2604.23904v1 Apr 26, 2026 stat.ME

인과 추론을 위한 생성형 합성 데이터: 함정, 해결책 및 기회

Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Citations: 23

h-index: 3

합성 데이터는 개인 정보 보호 데이터 공개, 데이터 증강 및 시뮬레이션을 위한 유망한 도구이지만, 인과 추론에 사용될 경우 예측 정확도뿐만 아니라 인과 관계를 보존하는 것이 중요합니다. 본 연구에서는 GAN 및 LLM 기반 모델을 포함한 완전한 생성형 표 형식 합성 모델이 실제 데이터를 기반으로 테스트하는 경우 높은 성능을 달성할 수 있지만, 평균 치료 효과(ATE)와 같은 인과 추론 관련 지표를 심각하게 왜곡할 수 있음을 보여줍니다. 본 연구는 민감도 분석 및 절충 관계 분석 결과를 통해 ATE 보존을 위해서는 생성된 변수의 분포와 결과 회귀 모델에서 치료-효과 관계를 모두 제어해야 함을 공식화합니다. 이러한 관찰을 바탕으로, 본 연구에서는 변수를 치료 및 결과 메커니즘과 분리하여 생성하는 하이브리드 합성 데이터 프레임워크를 제안합니다. 이 프레임워크는 변수 합성 과정을 모니터링하기 위해 가장 가까운 레코드와의 거리를 사용하고, (W, A, Y) 셋을 구축하기 위해 별도로 학습된 누락 모델을 사용합니다. 또한, 현실적인 데이터 환경에서의 긍정성 문제 해결을 위한 표적 합성 데이터 증강 방법을 연구하고, 추가된 중첩이 조건부 효과 추정 개선에 얼마나 기여하는지, 그리고 변수 분포를 얼마나 변화시키는지 분석합니다. 마지막으로, 현실적인 변수 구조 하에서 OR, IPW, AIPW 및 TMLE 추정량의 유한 표본 비교를 가능하게 하는 합성 시뮬레이션 엔진을 개발합니다. 실험 결과, 하이브리드 합성 데이터는 완전 생성형 기준 모델에 비해 ATE 보존 측면에서 현저히 우수하며, 견고한 인과 분석을 위한 실용적인 진단 도구를 제공합니다.

Original Abstract

Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!