2602.12566v1 Feb 13, 2026 cs.AI

혼합할 것인가 병합할 것인가: 거대 언어 모델을 위한 다중 도메인 강화 학습을 향하여

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang

Citations: 30

h-index: 3

Ziheng Li

Citations: 60

h-index: 5

Yilong Xu

Citations: 107

h-index: 5

Yehui Tang

Citations: 18

h-index: 3

Xiang Long

Citations: 4,030

h-index: 5

Tingguang Li

Citations: 87

h-index: 4

검증 가능한 보상을 활용한 강화 학습(RLVR)은 거대 언어 모델(LLM)의 명시적 추론 능력을 자극하는 데 핵심적인 역할을 합니다. 우리는 코딩이나 수학과 같은 특정 도메인에서 RLVR을 통해 전문가 수준의 성능을 달성할 수 있습니다. 범용적인 다중 도메인 전문가 수준의 모델이 필요할 때, 우리는 서로 다른 도메인 간 RLVR의 상호작용을 신중하게 고려해야 합니다. 현재 최신 모델들은 다중 도메인 RLVR을 위해 주로 두 가지 학습 패러다임, 즉 혼합 멀티태스크 RLVR과 개별 RLVR 후 모델 병합 방식을 사용합니다. 그러나 대부분의 연구는 이러한 패러다임에 대한 상세한 비교와 분석을 제공하지 않았습니다. 이를 위해 우리는 널리 사용되는 여러 고차원 작업(예: 수학, 코딩, 과학, 지시 이행)을 목표 도메인으로 선정하고, 오픈 소스 데이터셋을 사용하여 광범위한 정성적 및 정량적 실험을 설계했습니다. 연구 결과, 도메인 간 RLVR은 상호 간섭이 거의 없으며, 추론 집약적인 도메인들은 상호 시너지 효과를 나타낸다는 것을 확인했습니다. 더 나아가, 우리는 가중치 공간 기하학, 모델 예측 행동, 정보 제약의 관점에서 이러한 상호 이득의 내부 메커니즘을 분석합니다. 이 프로젝트는 강화 학습을 위한 혼합 멀티태스크 학습 또는 개별 학습 후 모델 병합을 의미하는 M2RL로 명명되었으며, 홈페이지 주소는 https://github.com/mosAI25/M2RL 입니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL

3 Citations

0 Influential

22.5 Altmetric

115.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!