Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Abstract 4372933: Hybrid NLP Model Accurately Extracts Data from Tetralogy of Fallot Cardiac MR Reports
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Introduction: Utilization of large language models (LLMs) for named entity recognition from free-text medical reports is rapidly expanding. However, concerns about protected health information (PHI) restrict the adoption of commercial LLMs. While simpler natural language processing methods, such as regular expressions (RegEx), offer an accessible solution, they often fail when linguistic variability increases. Particularly in rare conditions with limited labeled datasets, advanced machine learning models like Bidirectional Encoder Representations from Transformers (BERT) are difficult to train. Thus, there is a need for secure, efficient, and accurate data extraction methods. Research Questions: We hypothesized that a hybrid approach combining simple RegEx with few-shot prompts on an on-premises LLM would maximize accuracy and efficiency while maintaining PHI compliance. Methods: We retrospectively analyzed cardiovascular magnetic resonance (CMR) reports from 183 patients. Custom RegEx rules and few-shot LLM prompts were independently applied across all reports. A hybrid extraction approach integrated both methods by selectively using LLM results in areas of poor RegEx performance. Ground truths were manually verified by a clinical expert. Performance was evaluated using Coverage, Precision, Recall, and F 1 -score metrics. Results: A manual review of 430 CMR reports (3/2005-12/2024) identified a median proportion of missing values of 3.95% (IQR 2.79–5.12) across 13 clinical metrics. The baseline RegEx extraction alone achieved a completeness of 90.7%, whereas the standalone few-shot LLM approach reached 91.9%. Combining RegEx with targeted few-shot LLM prompts, the hybrid method significantly improved data completeness to 99.8%. In terms of accuracy, the hybrid approach attained an F 1 score of 97.5%±3.6, clearly outperforming RegEx alone (85.2%±22.2) and the standalone LLM (86.0%±15.1). Pairwise comparisons confirmed differences were significant (p<0.001) with large effect sizes (Cohen’s d >1.0). Additionally, the hybrid approach reduced computational time by approximately 75% compared to the LLM-only method. Conclusion: A hybrid NLP method combining deterministic RegEx and targeted LLM prompts significantly enhances data extraction accuracy from legacy clinical free-text reports. This approach addresses PHI security concerns and effectively reallocates annotation resources toward predictive modeling, thereby advancing clinical research and quality improvement.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.391 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.721 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.261 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.695 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.436 Zit.