Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Abstract 4372933: Hybrid NLP Model Accurately Extracts Data from Tetralogy of Fallot Cardiac MR Reports

2025·0 Zitationen·Circulation

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Introduction: Utilization of large language models (LLMs) for named entity recognition from free-text medical reports is rapidly expanding. However, concerns about protected health information (PHI) restrict the adoption of commercial LLMs. While simpler natural language processing methods, such as regular expressions (RegEx), offer an accessible solution, they often fail when linguistic variability increases. Particularly in rare conditions with limited labeled datasets, advanced machine learning models like Bidirectional Encoder Representations from Transformers (BERT) are difficult to train. Thus, there is a need for secure, efficient, and accurate data extraction methods. Research Questions: We hypothesized that a hybrid approach combining simple RegEx with few-shot prompts on an on-premises LLM would maximize accuracy and efficiency while maintaining PHI compliance. Methods: We retrospectively analyzed cardiovascular magnetic resonance (CMR) reports from 183 patients. Custom RegEx rules and few-shot LLM prompts were independently applied across all reports. A hybrid extraction approach integrated both methods by selectively using LLM results in areas of poor RegEx performance. Ground truths were manually verified by a clinical expert. Performance was evaluated using Coverage, Precision, Recall, and F 1 -score metrics. Results: A manual review of 430 CMR reports (3/2005-12/2024) identified a median proportion of missing values of 3.95% (IQR 2.79–5.12) across 13 clinical metrics. The baseline RegEx extraction alone achieved a completeness of 90.7%, whereas the standalone few-shot LLM approach reached 91.9%. Combining RegEx with targeted few-shot LLM prompts, the hybrid method significantly improved data completeness to 99.8%. In terms of accuracy, the hybrid approach attained an F 1 score of 97.5%±3.6, clearly outperforming RegEx alone (85.2%±22.2) and the standalone LLM (86.0%±15.1). Pairwise comparisons confirmed differences were significant (p<0.001) with large effect sizes (Cohen’s d >1.0). Additionally, the hybrid approach reduced computational time by approximately 75% compared to the LLM-only method. Conclusion: A hybrid NLP method combining deterministic RegEx and targeted LLM prompts significantly enhances data extraction accuracy from legacy clinical free-text reports. This approach addresses PHI security concerns and effectively reallocates annotation resources toward predictive modeling, thereby advancing clinical research and quality improvement.

Autoren

Institutionen

Cleveland Clinic(US)

Themen

Machine Learning in HealthcareArtificial Intelligence in Healthcare and EducationCardiovascular Health and Risk Factors

Volltext beim Verlag öffnen

Abstract 4372933: Hybrid NLP Model Accurately Extracts Data from Tetralogy of Fallot Cardiac MR Reports

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen