2602.22743v1 Feb 26, 2026 cs.AI

생성적 데이터 변환: 분산된 데이터에서 통합된 데이터로

Generative Data Transformation: From Mixed to Unified Data

Enhong Chen

Citations: 1,961

h-index: 21

Jiaqing Zhang

Citations: 52

h-index: 3

Mingjia Yin

Citations: 216

h-index: 7

Hao Wang

Citations: 838

h-index: 13

Yuxin Tian

Citations: 72

h-index: 4

Yawen Li

Citations: 33

h-index: 3

Wei Guo

Citations: 344

h-index: 10

Yong Liu

Citations: 50

h-index: 4

Yuyang Ye

Citations: 172

h-index: 6

추천 모델의 성능은 학습 데이터의 품질, 양, 그리고 관련성에 근본적으로 의존합니다. 데이터 희소성 및 초기 사용자 문제와 같은 일반적인 과제를 해결하기 위해, 최근 연구에서는 대상 도메인의 정보를 풍부하게 하기 위해 여러 보조 도메인의 데이터를 활용해 왔습니다. 그러나 내재된 도메인 간 격차는 혼합 도메인 데이터의 품질을 저하시켜 부정적인 전이 효과를 일으키고 모델 성능을 저하시킬 수 있습니다. 기존의 extit{모델 중심} 패러다임은 복잡하고 맞춤화된 아키텍처에 의존하며, 도메인 간 미묘하고 비구조적인 시퀀스 의존성을 제대로 파악하지 못하여 일반화 성능이 낮고 계산 자원에 대한 요구가 높습니다. 이러한 단점을 해결하기 위해, 우리는 extsc{Taesar}라는 extit{데이터 중심} 프레임워크를 제안합니다. extsc{Taesar}는 extbf{t}arget- extbf{a}lign extbf{e}d extbf{s}equenti extbf{a}l extbf{r}egeneration의 약자로, 대조적 디코딩 메커니즘을 사용하여 도메인 간 컨텍스트를 대상 도메인 시퀀스에 적응적으로 인코딩합니다. extsc{Taesar}는 대조적 디코딩을 사용하여 도메인 간 컨텍스트를 대상 시퀀스에 인코딩함으로써, 표준 모델이 복잡한 융합 아키텍처 없이도 복잡한 의존성을 학습할 수 있도록 합니다. 실험 결과, extsc{Taesar}는 모델 중심 솔루션보다 성능이 우수하며 다양한 시퀀스 모델에 대한 일반화 능력을 보여줍니다. extsc{Taesar}는 풍부한 데이터 세트를 생성하여 데이터 중심 및 모델 중심 패러다임의 장점을 효과적으로 결합합니다. 본 논문에 대한 코드는 다음 주소에서 확인할 수 있습니다: extcolor{blue}{https://github.com/USTC-StarTeam/Taesar}.

Original Abstract

Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emph{model-centric} paradigm -- which relies on complex, customized architectures -- struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textsc{Taesar}, a \emph{data-centric} framework for \textbf{t}arget-\textbf{a}lign\textbf{e}d \textbf{s}equenti\textbf{a}l \textbf{r}egeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textsc{Taesar} outperforms model-centric solutions and generalizes to various sequential models. By generating enriched datasets, \textsc{Taesar} effectively combines the strengths of data- and model-centric paradigms. The code accompanying this paper is available at~ \textcolor{blue}{https://github.com/USTC-StarTeam/Taesar}.

0 Citations

0 Influential

30.5 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!