Enhancing Treatment Effect Heterogeneity in Behavioural Public Policy Interventions through Large Language Model–Generated Synthetic Data
We propose a framework that leverages LLMs to create diverse, synthetic participant profiles and trial outcomes based on real-world distributions and established behavioural theories. Using the concept of "homo silicus," our approach simulates larger and more varied study populations, potentially uncovering rare subgroups or interaction effects that conventional RCTs might miss due to sample size or homogeneity constraints. The framework ensures data realism through validation against existing datasets and adherence to behavioural plausibility.
Our methodology integrates real and synthetic participants in a multi-stage adaptive design, synthetic profiles using an LLM. We employ covariate-adjusted randomization and adaptive propensity scoring refined by LLM insights, alongside sequential randomization with synthetic validation. This integration facilitates the exploration of edge cases, covariate balancing, and enhanced adaptive design strategies tailored to behavioural interventions. Advanced analytical techniques, including causal forests, T-learners, and Treatment-Agnostic Representation Networks (TARNet), are utilized to estimate heterogeneous treatment effects with increased precision.
We present simulations that indicate that incorporating synthetic data identifies meaningful subgroups with differential treatment responses not apparent in the original datasets alone.