2604.16817v1 Apr 18, 2026 cs.LG

베이지안 교정을 통한 희귀 관계 데이터의 자기 강화 제어 합성

Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration

Julian Rodemann

Citations: 60

h-index: 5

Krikamol Muandet

CISPA Helmholtz Center for Information Security

Citations: 5,295

h-index: 25

Qilong Li

Citations: 12

h-index: 2

E. Arias

Citations: 107

h-index: 6

Christian Heumann

Citations: 113

h-index: 6

Chongsheng Zhang

Citations: 32

h-index: 3

Hao Wang

Citations: 31

h-index: 3

Zelong Yu

Citations: 5

h-index: 1

Zhanshuo Zhang

Citations: 25

h-index: 3

Gaojuan Fan

Citations: 501

h-index: 9

불균형 데이터는 실제 응용 분야에서 흔히 나타납니다. 데이터 증강은 희소 클래스의 데이터 부족 문제를 효과적으로 완화할 수 있지만, LLM(대규모 언어 모델)이 텍스트 생성에 혁명을 가져왔음에도 불구하고, LLM을 관계형/구조화된 표 데이터 생성에 적용하는 연구는 아직 미흡합니다. 또한, 기존 방법은 생성된 데이터의 품질을 지속적으로 최적화하도록 LLM을 안내할 수 있는 효과적인 피드백 메커니즘이 부족합니다. 본 연구에서는 다운스트림 불균형 분류 성능을 향상시키기 위해 점진적인 체인 오브 씽크(Chain-of-Thought, CoT) 단계를 활용하는 통합형 인컨텍스트 학습 프레임워크인 RDDG(Relational Data generator with Dynamic Guidance)를 제안합니다. RDDG는 먼저 핵심 집합 선택을 통해 원본 데이터에서 대표적인 샘플을 식별하고, 인컨텍스트 학습을 사용하여 핵심 집합 내 속성 간의 내재된 패턴과 상관관계를 파악한 다음, 위에서 언급한 제약 조건을 유지하면서 표 데이터를 생성합니다. 더욱 중요한 점은, 생성된 데이터의 품질에 대한 자동 평가를 제공하는 자기 강화 피드백 메커니즘을 통합하여 생성 프로세스 전반에 걸쳐 지속적인 품질 최적화를 가능하게 합니다. 여러 실제 및 합성 데이터셋에 대한 실험 결과는 RDDG가 기존 방법보다 데이터 충실도와 다운스트림 불균형 분류 성능 모두에서 우수한 성능을 보임을 보여줍니다. 저희 코드는 https://github.com/cszhangLMU/RDDG 에서 확인할 수 있습니다.

Original Abstract

Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.

1 Citations

0 Influential

32.5 Altmetric

163.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!