Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Human vs. AI: A comparative effectiveness study of large language models for automated biomarker extraction.

2025·0 Zitationen·Journal of Clinical Oncology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

e13605 Background: Unstructured data extraction is essential in oncology, with use cases spanning from individual patient care to system-wide applications like cancer registry reporting. However, it remains a semi-manual, time-intensive process that lacks real-time integration. Large language models (LLMs) have demonstrated remarkable natural language processing capabilities but are prone to inconsistencies and hallucinations, highlighting the need for refinement. As a proof-of-concept study, we optimized a prompt-engineered LLM to extract p16 and HPV biomarker data from unstructured pathology reports in a head & neck oncology population and evaluated its performance, efficiency, and cost-effectiveness against human extractors. Methods: We analyzed deidentified pathology reports from unique patients with head & neck cancer at Memorial Sloan Kettering Cancer Center. A prompt-engineered LLM, powered by Azure OpenAI GPT-4, a HIPAA complaint sandboxed LLM, extracted p16 and HPV statuses ("Positive," "Negative," or "Unknown") using zero-shot prompting. Prompts were optimized using the 5% of reports, with final model applied to remaining 95%. Model outputs were validated against a gold-standard dataset manually annotated by independent human review. Performance was evaluated using precision, recall, and F1-score. Efficiency was measured by an estimated 3-second processing time per report versus 1 minute manually. Cost-effectiveness was determined by comparing LLM’s processing cost of $0.005/report (average 1022 tokens) to manual extraction at a rate of $20/hour, along with a projection of 4,000 reports. MSK IRB approved the study. Results: Among 284 pathology reports, 132 (46%) included biomarker data: 80 (28%) were p16+, 52 (18%) p16-, 66 (23%) HPV+, and 43 (15%) HPV-. Prompted-engineered LLM extracted p16 status with 99.6% precision and 99.6% recall (F1-score 0.996); and extracted HPV status with 98.0% precision and 98.2% recall (F1-score 0.981). Processing time estimated to be 852 seconds (14.2 minutes). Compared to manual extraction, LLM was estimated to save 4.5 person-hours and $88.5 for 284 reports, with a projected savings of 63.3 hours and $1,247 for 4,000 reports. Conclusions: This proof-of-concept study demonstrates that a prompt-engineered LLM can accurately extract p16 and HPV biomarker data from unstructured pathology reports with high performance, significantly reducing manual processing time and costs. By optimizing prompt design to mitigate inconsistencies, we enhance the reliability of LLM-driven extraction, highlighting its potential as a real-time, scalable solution for unstructured data processing. Future research is needed to explore LLM applicability to extract other biomarkers, with further optimization on multi-modal integration and clinical implementation to enhance scalability and real-world impact.

Autoren

Institutionen

Themen

Biomedical Text Mining and OntologiesTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Human vs. AI: A comparative effectiveness study of large language models for automated biomarker extraction.

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen