Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI-based, secure and privacy-preserving synthetic data generation platform in transfusion-dependent β-thalassemia applied to the Webthal® dataset
0
Zitationen
29
Autoren
2025
Jahr
Abstract
Abstract Background While AI technologies for synthetic data (SD) generation are well-developed, their direct application in clinical settings remains challenging. Key issues include privacy and security concerns, which necessitate the use of closed, third-party models with sensitive patient data. Furthermore, many generative models lack adequate clinical validation, often failing to understand and replicate the complex correlations between clinical variables. These barriers are particularly acute in rare diseases like β-thalassemia, where data is already scarce. Our work addresses these specific challenges by implementing a secure, transparent platform to generate a high-fidelity synthetic cohort of patients with transfusion-dependent β-thalassemia (TDT). Aims This work aims: 1) to implement an AI-based SD generation platform within the locally secured, privacy-preserving environment of the Webthal® dataset; 2) to demonstrate its secure integration and efficacy in generating high-quality synthetic clinical data, suitable for clinical research, creation of digital twins and synthetic control arms for clinical trials. Methods We implemented the TRAIN SD generation platform (www.train-ai.eu) within the Webthal environment. This platform integrates various generative models designed for multimodal data. A selected CT-WGAN was trained on a retrospective cohort of 779 adult (≥18 years) TDT patients. Real-world data were collected from Italian centers using the Webthal® computerized medical record from 2010 to 2019. To assess the quality of the SD, SAFE (Synthetic vAlidation FramEwork) was implemented within the platform to evaluate statistical fidelity, clinical utility, and privacy preservability. SAFE computes several statistical metrics aggregated into Clinical Synthetic Fidelity (CSF) for statistical fidelity and nearest neighbor distance ratio (NNDR) for privacy preservability. For clinical validation, we used the synthetic TDT cohort to replicate the findings of Musallam et al. (PMID:37976447) on the association between pre-transfusion hemoglobin (Hb) levels and mortality, comparing variables distributions and clinical conclusions from SD analysis with those derived from the real cohort. All analyses were conducted in three settings: 1) creating a 1:1 privacy-preserving proxy of the original WebTHAL dataset; 2) augmenting the cohort to twice its size for simulation purposes; and 3) conditionally generating a dataset with specific patient characteristics (deaths by Hb category) to show its flexibility for clinical research. Results Patient distributions, stratified by pre-transfusion Hb levels, were consistent with the original cohort. The 5- and 10-year survival rates, as well as the unadjusted and adjusted Hazard Ratios for mortality, were comparable between the real and synthetic data. No identical matches to a real patient were found in any of the generated datasets. The replicated dataset (n=779) demonstrated high fidelity (CSF=0.91, NNDR=0.84). The Kaplan-Meier log-rank test for survival stratified by Hb category was highly comparable between the original (chisq=11.6, p=0.02) and synthetic datasets (chisq=12.9, p=0.01). The augmented dataset (n=1558) maintained high fidelity (CSF=0.90, NNDR=0.82). The increased statistical power was evident in the log-rank test, which yielded a higher significant result (chisq=46.1, p<0.001) compared to the original (p=0.02), demonstrating the utility of augmentation for strengthening statistical signals. The conditionally generated cohort (n=779) showed excellent fidelity (CSF=0.90, NNDR=0.81) and remarkable clinical mimicry. It replicated the statistical outcome of the log-rank test on the original data (synthetic: chisq=11.2, p=0.02. When stratifying by ferritin levels, the found association between mortality and Hb in ≤1000 ng/mL category of original data, was in the >1000 ng/mL category in the replicated datasets and in both in the larger CD cohort. Conclusions The study demonstrates that a robust, clinically-oriented platform can generate high-quality, useful SD even in a complex setting of rare diseases like β-thalassemia, successfully overcoming the installation and validation limitations. SD replicated key clinical outcomes and statistical properties, enabled effective data augmentation, and allowed for flexible cohort simulation. This validated technology therefore represents a powerful tool to overcome data sharing barriers and accelerate precision medicine research in hematology.
Ähnliche Arbeiten
Automatic Recording Apparatus for Use in Chromatography of Amino Acids
1958 · 9.602 Zit.
Enzymatic Amplification of β-Globin Genomic Sequences and Restriction Site Analysis for Diagnosis of Sickle Cell Anemia
1985 · 8.995 Zit.
Estimation of total, protein-bound, and nonprotein sulfhydryl groups in tissue with Ellman's reagent
1968 · 7.950 Zit.
Hepcidin Regulates Cellular Iron Efflux by Binding to Ferroportin and Inducing Its Internalization
2004 · 4.720 Zit.
A novel MHC class I–like gene is mutated in patients with hereditary haemochromatosis
1996 · 3.705 Zit.
Autoren
- Mattia Delleani
- Barbara Gianesin
- Saverio D’Amico
- Alessandro Bruseghini
- Susanna Barella
- Raffaella Origa
- Giovanni Battista Ferrero
- Roberto Lisi
- A Pasanisi
- Filomena Longo
- Valeria Maria Pinto
- Gianluca Asti
- Elisabetta Sauta
- Eleonora Iascone
- Luca Lanino
- Alessia Campagna
- Giulia Maggioni
- Marta Ubezio
- Antonio Russo
- Gabriele Todisco
- Cristina Astrid Tentori
- Alessandro Buizza
- Ivan Ferrari
- Matteo Zampini
- Marilena Bicchieri
- Victor Savevski
- Khaled M. Musallam
- Matteo Della Porta
- Gian Luca Forni
Institutionen
- IRCCS Humanitas Research Hospital(IT)
- Associazione Italiana Sclerosi Multipla(IT)
- LivaNova (Italy)(IT)
- Ospedale Microcitemico(IT)
- University of Cagliari(IT)
- Ospedale San Luigi Gonzaga(IT)
- Ospedale Garibaldi(IT)
- Ospedale A. Perrino(IT)
- Arcispedale Sant'Anna(IT)
- Ospedale Policlinico San Martino(IT)
- University of New Haven(US)
- Humanitas University(IT)
- Khalifa University of Science and Technology(AE)
- Cornell University(US)
- Center for Cancers and Blood Disorders(US)
- Istituto Giannina Gaslini(IT)