Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
CardioSynth: privacy-preserving generation and evaluation of multimodal synthetic data for cardiovascular research
0
Zitationen
8
Autoren
2026
Jahr
Abstract
Abstract Background Secondary use of cardiology data is vital for trend analysis and data-driven healthcare innovation, but privacy and regulations limit access. Synthetic data offers a solution, yet cardiology lacks a standardized, privacy-preserving framework for generating high-quality, multimodal datasets. Purpose This study presents a framework for generating multimodal, differentially private (DP) synthetic data to enable collaborative research under diverse privacy constraints. The data will form part of CardioSynth, a European dataset for privacy-preserving AI in cardiology. Methods A secure SD generation framework for cardiology was deployed in eight European sites using real-world multimodal data—structured, images, and unstructured—from both local hospital systems and the publicly available MIMIC-IV. For structured data, we evaluated Probabilistic Graphical Models and Deep Generative Models; the latter were also applied to imaging data. Both model families were trained using publicly available data and clinical data from participating hospitals. Centralized and federated configurations were tested. Fidelity, utility, and formal privacy were assessed under inferential privacy budgets ε ∈ {0.1, 1, 5, 10, 40}, enabling private cardiovascular risk modeling. For unstructured data, the DP-In Context Learning (ICL) approach was adopted. Private datasets were partitioned into multiple subsets, generating ICL prompts for an LLM, whose outputs were aggregated using Embedding Space Aggregation to preserve both privacy and semantic meaning. Model fine-tuning was not required. The entire pipeline is available as a web app. Generated notes were evaluated by 19 experts who were presented with a subsample of 5 generated samples. Evaluation focused on data coherence, usability, language, and style to assess utility. For all these features, the reviewers were asked to rate the texts on a scale from 1 to 5. Results PrivBayes preserved marginals of structured SD (<5%) but had limited fidelity (AUROC = 0.71). DP-GAN improved structure (correlation error < 0.06; AUROC = 0.82; MIA ≤ 1.2%). Federated DP-GAN scaled to 9 hospitals, converged (10–27 rounds), and retained utility (AUROC = 0.79 at ε = 2). As expected, higher ε resulted in better fidelity but lower levels of privacy. DP-GAN enabled imaging synthesis under ε ≤ 5. The outputs for unstructured generation were successfully generated using Llama3.1. The clinical evaluation showed good coherence and generally actionable generations. The average scores for the main text features across all samples and voters are in the range between 3.2 and 3.92. Reviewers found the generated texts realistic. Conclusions A framework was established to generate multimodal cardiology SD from clinically available sources. This work lays the groundwork for creating CardioSynth, a public comprehensive SD repository that supports collaborative research while accommodating diverse privacy requirements.CardioSynth: Cardiology Synth DataCardioSynth: Synthetic Data Evaluation
Ähnliche Arbeiten
k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY
2002 · 8.396 Zit.
Calibrating Noise to Sensitivity in Private Data Analysis
2006 · 6.872 Zit.
Deep Learning with Differential Privacy
2016 · 5.595 Zit.
Communication-Efficient Learning of Deep Networks from Decentralized\n Data
2016 · 5.591 Zit.
Large-Scale Machine Learning with Stochastic Gradient Descent
2010 · 5.564 Zit.