2602.09173v1 Feb 09, 2026 cs.LG

n-Musketeers: 강화 학습을 통한 언어 모델 간 협력

$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

Mahdi Imani

Citations: 77

h-index: 6

Ryozo Masukawa

Citations: 66

h-index: 5

Sanggeon Yun

Citations: 210

h-index: 8

Hyunwoo Oh

Citations: 8

h-index: 2

Hanning Chen

Citations: 276

h-index: 10

Wenjun Huang

Citations: 130

h-index: 6

Nathaniel D. Bastian

Citations: 21

h-index: 3

Mohsen Imani

Citations: 7

h-index: 2

S. Jeong,

Citations: 0

h-index: 0

Raheeb Hassan

Citations: 2

h-index: 1

P. Mercati

Citations: 449

h-index: 13

최근 강화 학습과 검증 가능한 보상(RLVR) 분야의 발전은 작은, 특화된 언어 모델(SLM)들이 거대한 통합형 LLM에 의존하지 않고도 체계적인 추론 능력을 보여줄 수 있음을 보여줍니다. 본 연구에서는 소프트 히든-스테이트 협력(soft hidden-state collaboration)이라는 새로운 방법을 제시합니다. 이 방법은 여러 개의 이질적인, 고정된 SLM 전문가들을 학습 가능한 어텐션 인터페이스를 통해 내부 표현을 공유하며 통합합니다. Reasoning Gym 및 GSM8K 데이터셋에 대한 실험 결과, 이 잠재적인 통합 방식은 강력한 단일 모델 RLVR의 성능과 경쟁력을 보여줍니다. 추가 분석 결과, 전문가 활용에는 두 가지 메커니즘이 작용합니다. 비교적 간단한 산술 영역에서는 성능 향상이 주로 정적인 전문가 선호도에 의해 설명되지만, 더 복잡한 환경에서는 학습 과정에서 전문가에 대한 집중적이고 체계적인 어텐션이 증가하며, 이는 라우터가 관련 전문가에 연결되는 방식의 점진적인 특화 현상을 나타냅니다. 전반적으로, 히든-스테이트 협력은 고정된 전문가를 활용하는 데 있어 효율적인 메커니즘을 제공하며, RLVR 하에서 전문가 활용 패턴 및 그 진화 과정을 관찰할 수 있는 통찰력을 제공합니다.

Original Abstract

Recent progress in reinforcement learning with verifiable rewards (RLVR) shows that small, specialized language models (SLMs) can exhibit structured reasoning without relying on large monolithic LLMs. We introduce soft hidden-state collaboration, where multiple heterogeneous frozen SLM experts are integrated through their internal representations via a trainable attention interface. Experiments on Reasoning Gym and GSM8K show that this latent integration is competitive with strong single-model RLVR baselines. Ablations further reveal a dual mechanism of expert utilization: for simpler arithmetic domains, performance gains can largely be explained by static expert preferences, whereas more challenging settings induce increasingly concentrated and structured expert attention over training, indicating emergent specialization in how the router connects to relevant experts. Overall, hidden-state collaboration provides a compact mechanism for leveraging frozen experts, while offering an observational window into expert utilization patterns and their evolution under RLVR.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!