2602.01365v1 Feb 01, 2026 cs.LG

도메인이 상호작용할 때: 강화 학습 기반 추론에서의 비대칭적이고 순서 민감한 교차 도메인 효과

When Domains Interact: Asymmetric and Order-Sensitive Cross-Domain Effects in Reinforcement Learning for Reasoning

Chuan Ma

Citations: 5

h-index: 1

Wang Yang

Citations: 56

h-index: 3

Shouren Wang

Case Western Reserve University

Citations: 2

h-index: 1

Chaoda Song

Citations: 8

h-index: 2

Xinpeng Li

Citations: 53

h-index: 3

Nengbo Wang

Citations: 13

h-index: 1

Kaixiong Zhou

Citations: 2,102

h-index: 22

Vipin Chaudhary

Citations: 116

h-index: 5

Xiaotian Han

Citations: 95

h-index: 5

그룹 상대 정책 최적화(GRPO)는 대규모 언어 모델의 추론 능력을 향상시키는 핵심 기술로 자리 잡았지만, 다양한 도메인 순서 전략에 따른 GRPO의 동작 방식은 제대로 이해되지 않고 있습니다. 특히, GRPO에서 순차적 훈련(한 번에 하나의 도메인)과 혼합 도메인 훈련(한 번에 여러 도메인)의 영향이 체계적으로 연구된 적이 없습니다. 본 연구에서는 수학, 과학, 논리, 퍼즐 추론 작업에 대한 훈련 순서 효과를 최초로 체계적으로 분석했습니다. 그 결과, (1) 단일 도메인 일반화는 매우 비대칭적입니다. 다른 도메인으로 훈련하면 수학 추론 정확도가 약 25% 향상되는 반면, 논리와 퍼즐에는 거의 전이가 없습니다. (2) 교차 도메인 상호작용은 순서에 매우 의존적입니다. 수학→과학 순서로 훈련하면 수학/과학에서 각각 83%/41%의 정확도를 달성하는 반면, 순서를 과학→수학으로 바꾸면 성능이 77%/25%로 저하됩니다. (3) 다중 도메인 훈련에서 단일 전략이 보편적으로 최적은 아닙니다. 순차적 훈련은 수학에 유리(최대 84%), 혼합 훈련은 과학과 논리에 유리하며, 잘못된 순서는 상당한 성능 격차(70%에서 56%까지)를 초래할 수 있습니다. 전반적으로, 본 연구의 결과는 다중 도메인 환경에서 GRPO가 현저한 비대칭성, 순서 민감성, 전략 의존성을 나타냄을 보여주며, 도메인 인지적이고 순서 인지적인 훈련 설계의 필요성을 강조합니다.

Original Abstract

Group Relative Policy Optimization (GRPO) has become a key technique for improving reasoning abilities in large language models, yet its behavior under different domain sequencing strategies is poorly understood. In particular, the impact of sequential (one domain at a time) versus mixed-domain (multiple domain at a time) training in GRPO has not been systematically studied. We provide the first systematic analysis of training-order effects across math, science, logic, and puzzle reasoning tasks. We found (1) single-domain generalization is highly asymmetric: training on other domains improves math reasoning by approximately 25\% accuracy, while yielding negligible transfer to logic and puzzle; (2) cross-domain interactions are highly order-dependent: training in the order math$\rightarrow$science achieves 83\% / 41\% accuracy on math / science, while reversing the order to science$\rightarrow$math degrades performance to 77\% / 25\%; (3) no single strategy is universally optimal in multi-domain training: sequential training favors math (up to 84\%), mixed training favors science and logic, and poor ordering can incur large performance gaps (from 70\% to 56\%). Overall, our findings demonstrate that GRPO under multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!