Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Generation of Synthetic Data in Health Surveys Using Large Language Models

2026·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background Generating synthetic data using artificial intelligence, such as large language models (LLMs), is a useful strategy in public health because it can reduce time and costs, expand access to data, and facilitate information sharing without compromising confidentiality. Objective To evaluate the consistency and psychometric plausibility of synthetic data generated by an LLM to simulate the responses of survey participants (user personas) in a national health survey in Peru. Methods We conducted a cross-sectional study based on the National Health Satisfaction Survey (ENSUSALUD 2016) of ambulatory health service users. We used the GPT-OSS-20B model to generate synthetic responses in Spanish, conditioned on narrative profiles derived from sociodemographic and clinical variables. We evaluated consistency between responses and profile characteristics (sex, age, and comorbidities) using performance metrics (accuracy, precision, recall, F1 score, and AUC). We compared distributions between real and synthetic data using t-tests and chi-square tests. For latent variables, we conducted confirmatory factor analyses of the PHQ-9, PHQ-8, and GAD-7 (WLSMV; polychoric matrices) and estimated internal consistency (α and ω). We examined normality (Jarque–Bera test) and stability through correlations between real measures (PHQ-2 and EQ-5D) and synthetic measures (PHQ-2, PHQ-8, PHQ-9, GAD-2, and GAD-7). Results The model showed strong concordance with the profile for sex, age, and chronic disease status, with metrics close to 1 for most variables; overall consistency was high in the vast majority of cases. The synthetic PHQ-9, PHQ-8, and GAD-7 instruments showed optimal factor fit and high internal consistency. Synthetic measures were positively and significantly correlated with the real PHQ-2 and negatively correlated with EQ-5D, with moderate to high correlations, particularly for PHQ-8/PHQ-9 and GAD-7. Conclusions An LLM can generate plausible synthetic data for health surveys when its output is conditioned on user personas, preserving high coherence with demographic and clinical characteristics and maintaining adequate psychometric properties in depression and anxiety scales. However, relevant deviations were identified (e.g., overestimation of obesity, unexpected distributions in some variables, and missing values in a sensitive item), which supports the need for rigorous validation and bias control before using these data for inferential purposes or public policy.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareMental Health via Writing

Volltext beim Verlag öffnen

Generation of Synthetic Data in Health Surveys Using Large Language Models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen