2605.07244v1 May 08, 2026 cs.LG

이종 언어 모델을 위한 상호 강화 학습에서의 경험 공유

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

Dhananjay Ram

Citations: 121

h-index: 4

Wei Xia

Citations: 82

h-index: 5

S. Soatto

Citations: 1,675

h-index: 19

Yuting Zhang

Citations: 8

h-index: 2

Zhaoyang Zhang

Citations: 41

h-index: 3

Xiaoze Liu

Citations: 2,445

h-index: 20

본 논문에서는 상호 강화 학습(Mutual Reinforcement Learning)이라는 프레임워크를 소개합니다. 이 프레임워크는 서로 다른 LLM 정책들이 독립적인 파라미터, 목표, 그리고 토크나이저를 유지하면서 동시에 유형화된 경험을 교환하며 동시 RL 후처리를 수행합니다. 이 프레임워크는 공유된 경험 교환(Shared Experience Exchange, SEE), 다중 워커 자원 할당(Multi-Worker Resource Allocation, MWRA), 그리고 텍스트를 재토큰화하고 호환되지 않는 어휘 간의 토큰 수준 추적을 정렬하는 토크나이저 이질성 레이어(Tokenizer Heterogeneity Layer, THL)를 결합합니다. 이러한 구조는 다양한 모델 패밀리에 걸쳐 경험 공유 설계를 실현 가능하게 합니다. GRPO를 기반으로 데이터 수준의 롤아웃 공유(Peer Rollout Pooling, PRP), 값 수준의 장점 공유(Cross-Policy GRPO Advantage Sharing, XGRPO), 그리고 결과 수준의 성공 전이(Success-Gated Transfer, SGT)라는 세 가지 제어된 방법을 구현했습니다. 컨텍스트 밴딧 분석을 통해 이 방법들이 안정성-성능 균형에서 갖는 구조적 위치를 특성화했습니다. PRP는 밀도 비율의 변동성과 THL의 잔여 비용을 발생시키고, XGRPO는 온-정책 액터 지원을 유지하면서 스칼라 기준선을 변경하며, SGT는 검증된 동료의 성공을 향한 '구조 해제' 점수를 제공합니다. 평가된 환경에서 결과 수준의 공유가 이 균형에서 가장 유리한 지점을 차지합니다.

Original Abstract

We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!