2601.07224v1 Jan 12, 2026 cs.AI

공고화인가 적응인가? PRISM: 그래디언트 집중도를 통한 SFT 및 RL 데이터 분리

Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Yang Zhao

Citations: 70

h-index: 4

Yangou Ouyang

Citations: 7

h-index: 2

Xiao Ding

Citations: 1,264

h-index: 17

Hepeng Wang

Citations: 24

h-index: 3

Bibo Cai

Citations: 94

h-index: 6

Kai Xiong

Research Center for Social Computing and Information Retrieval

Citations: 406

h-index: 9

Jin-Fang Gao

Citations: 170

h-index: 5

Zhouhao Sun

Citations: 87

h-index: 5

Bing Qin

Citations: 411

h-index: 12

Ting Liu

Citations: 1,454

h-index: 15

Li Du

Citations: 40

h-index: 4

지도 미세 조정(SFT)에 이어 강화 학습(RL)을 수행하는 하이브리드 방식이 LLM 에이전트 훈련의 표준 패러다임이 되었지만, 이 단계 간의 데이터 할당을 위한 효과적인 메커니즘은 여전히 충분히 연구되지 않았습니다. 현재의 데이터 중재 전략은 종종 내재적인 학습 요구를 진단하지 못하는 표면적인 휴리스틱에 의존하고 있습니다. SFT는 모방을 통한 패턴 공고화를 목표로 하는 반면 RL은 탐색을 통한 구조적 적응을 유도하기 때문에, 데이터를 이러한 기능적 역할에 맞게 정렬하지 못하면 심각한 최적화 간섭을 초래합니다. 본 논문에서는 스키마 이론(Schema Theory)에 기반하여 모델의 기존 지식과의 인지적 갈등 정도에 따라 데이터를 중재하는 다이내믹스 인식 프레임워크인 PRISM을 제안합니다. PRISM은 그래디언트의 공간적 기하학 구조를 분석하여 높은 공간 집중도를 유발하는 데이터를 구조적 재구조화가 필요한 고갈등 신호로 식별하여 RL에 할당합니다. 반면, 확산된 업데이트를 보이는 데이터는 효율적인 공고화를 위해 SFT로 보냅니다. WebShop과 ALFWorld에서의 광범위한 실험 결과, PRISM은 최신 하이브리드 방법들을 능가하면서도 계산 비용을 최대 3.22배 절감하여 파레토 개선을 달성했습니다. 이러한 결과는 내부 최적화 체제에 기반하여 데이터를 분리하는 것이 확장 가능하고 견고한 에이전트 정렬에 필수적임을 시사합니다.

Original Abstract

While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.

2 Citations

0 Influential

8.5 Altmetric

44.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!