OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.04.2026, 05:19

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Electronic heath record phenotypes to classify sickle cell anemia versus other subtypes of sickle cell disease in the ASH RC data hub

2025·0 Zitationen·BloodOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

Abstract Introduction: Sickle cell disease (SCD) comprises a group of severe hemoglobinopathies caused by mutations in the β-globin gene (HBB) and includes sickle cell anemia (homozygous SCD, HbSS) and compound heterozygous conditions, such as hemoglobin SC disease (HbSC) and sickle-βnull (HbSβ0) and sickle-βplus thalassemia (HbSβ+). Over 100,000 people in the U.S. have SCD and worldwide over 300,000 infants with SCD are born each year. We need automated methods to determine subtypes of SCD to advance population-based research and support learning healthcare systems. Our goal is to use data from electronic health records (EHR) to accurately classify subtypes of SCD. Methods: We used real-world longitudinal (1/1/2015-12/31/2023) EHR data including patient characteristics, pharmacotherapy, conditions, acute care, SCD and sickle trait codes, and labs from 12 centers in the ASH Research Collaborative (ASH-RC) Data Hub. We consolidated labs over similar LOINC names and harmonized units across sites. We imputed missing data, aggregated data by year and used the median value for laboratory tests (if multiple tests per year). We modeled the subtype of SCD (HbSS/ HbSβ0 vs. HbSC/ HbSβ+) using logistic regression (R 4.5.0), linear mixed-effects with site random intercept effect and random forest methods, after splitting the data set into a training (80%) and test (20%) set. We repeated all models in the subcohort with hemoglobin electrophoresis or hemoglobin high-performance liquid chromatography (HPLC) results. We calculated the accuracy, Cohen's kappa, sensitivity, specificity, and positive and negative predictive value for classification of HbSS/ HbSβ0 vs. HbSC/ HbSβ+for each model overall, by age group (< 18 years, 18 to 39.99 years, and > 40 years), and site. Results: We included 9,086 people with an ASH-RC principal investigator-confirmed diagnosis of SCD of which 6,281 (69.1%) had HbSS, 196 (2.2%) HbSβ0, 2,046 (22.5%) HbSC, and 563 (6.2%) HbSβ+. Mean age was 19.1 years (SD 17.1) with 52.6% females. The linear mixed-effects model had an accuracy of 0.859 (95% CI 0.852, 0.865) with a kappa of 0.636 (good agreement), sensitivity of 0.59 and specificity of 0.92. Model accuracy varied by age group, with higher accuracy for people <18 (0.882, 95% CI 0.873-0.890) vs. 18 to 39.99 (0.854, 95% CI 0.843-0.865), and >40 years old (0.791, 95% CI0.771-0.810). Accuracy also varied by site ranging from 0.632 (95% CI 0.597-0.666) to 0.965 (95% CI 0.939-0.982). Hemoglobin electrophoresis or HPLC results were available for 37.2% of participants (N=3,378) across 7 of the 12 sites and incorporating these tests improved overall accuracy to 0.901 (95% CI 0.888-0.913) and kappa to 0.63. The random forest models had similar performance to the linear-mixed effects models. Missing data for commonly obtained laboratory tests was more frequent than expected with 9.6% without results for red cell distribution width (RDW), mean corpuscular volume (8.5%) (MCV), or reticulocyte count (23.2%). Conclusions: We developed an automated approach to subclassify similar genotypes of SCD (HbSS/HbSβ0 vs. HbSC/HbSβ+) using discrete EHR data elements including clinical features and laboratory values from multiple sites participating in the ASH RC Data Hub. Challenges included standardization of laboratory test results across sites, identification of steady-state laboratory results and limited availability of results that quantified variant hemoglobins. Despite these limitations, the accuracy of this approach was moderate to high. However, some misclassification occurred, even with the addition of hemoglobin electrophoresis and HPLC results, that was unexpected. The significant variation in the accuracy of the linear mixed-effects model across sites may be from variation in the proportion of missing data, the approach used to determine the investigator-confirmed diagnosis of SCD, the age distribution of people living with SCD, or different rates of transfusion. The lower accuracy in older people was expected as more adults are transfused which modifies the laboratory values with the greatest discriminatory power (MCV, RDW, reticulocyte count, and variant hemoglobin %). Automated EHR approaches could be extremely useful to advance SCD research, but currently there is a need to reduce misclassification through better standardization and completeness of data with broader inclusion of results from variant hemoglobin quantification.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Hemoglobinopathies and Related DisordersMachine Learning in HealthcareArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen