Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Screening oncology articles in a qualitative literature review using large language models: A comparison of GPT4 versus fine-tuned open source models using expert-annotated data.
1
Zitationen
6
Autoren
2024
Jahr
Abstract
e23196 Background: Clinical Outcome Assessment (COA) conceptual gap analyses for oncology are complex and time consuming. Artificial Intelligence may efficiently reduce time to completion of such analyses. We aimed to assess two AI models’ performance for literature screening to identify relevant qualitative oncology research. We also compared accuracy and run-time for both AI models. Methods: We manually curated a dataset of title/abstract screening (n = 1,700 study references) across 17 landscape reviews. Among these, 11 landscape reviews (n = 951 study references) spanning over were in oncology, including 8 solid cancers (breast, lung, urothelial, colorectal, esophageal, head and neck, pancreatic, and stomach) and 3 non-solid cancers (lymphoma, acute myeloid leukemia, and multiple myeloma). Each citation was annotated for eligibility (Y/N) by population, study design (qualitative), and reporting of concepts (how patients feel or function). We then compared the accuracy of two AI models at predicting the screening decisions of expert researchers. The two AI models were Generative Pre-trained Transformers 4 (GPT4, OpenAI) prompts and a fine-tuned SciFive biomedical large language model (LLM). We used 70% of the data for training and 30% for test. Accuracy estimates were obtained only for the models’ ability to label eligibility within the 11 oncology datasets. Results: Both LLMs performed well for assessing relevance by oncology population, with F1-scores for the GPT4 and SciFive models being 0.92 and 0.83 respectively (precision was 0.92 and 0.93 respectively). For concept reporting the fine-tuned SciFive model outperformed GPT4 with an F1-score and precision 0.88 and 0.92 versus 0.81 and 0.79. The same was true but less pronounced for eligibility by study design, with an F1-score and precision 0.81 and 0.90 versus 0.86 and 0.76. For overall eligibility, the customized SciFive model outperformed the GPT4 model with an F1-score and precision of 0.84 and 0.92 versus 0.85 and 0.82. Lastly, it took the GPT4 prompts between 10-30 minutes to screen 100 abstracts. By contrast, the customized SciFive model took 1-2 minutes on a computer with a Quadro RTX 8000 GPU. Conclusions: In conclusion, both AI models are promising. The fine-tuned SciFive model appears slightly more accurate and performs substantially faster than the GPT4 model.
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.895 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.571 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.766 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.107 Zit.