2604.18530v1 Apr 20, 2026 cs.AI

OGER: 하이브리드 강화 학습을 위한 강력한 오프라인 가이드 기반 탐색 보상

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

Yujia Liu

Citations: 0

h-index: 0

Xuebo Liu

Citations: 477

h-index: 11

Min Zhang

Citations: 24

h-index: 3

Changhong Jin

Citations: 35

h-index: 4

Qiang Wang

Citations: 7

h-index: 1

Derek F. Wong

Citations: 17

h-index: 1

Mingzhou Xu

Citations: 130

h-index: 6

검증 가능한 보상을 활용한 강화 학습(RLVR)의 최근 발전은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시켰지만, 모델은 종종 초기 잠재 공간을 벗어난 새로운 경로를 탐색하는 데 어려움을 겪습니다. 오프라인 가이드 및 엔트로피 기반 전략이 이러한 문제를 해결하기 위해 제안되었지만, 이러한 방법들은 종종 깊이 있는 통합이 부족하거나 모델의 고유한 성능 제한에 의해 제약됩니다. 본 논문에서는 오프라인 가이드와 온라인 강화 학습을 특수한 보상 모델링 관점에서 통합하는 새로운 프레임워크인 OGER를 제안합니다. OGER는 다중 튜터 협업 학습을 사용하고, 오프라인 경로와 모델 자체의 엔트로피를 활용하여 자율적인 탐색을 장려하는 보조 탐색 보상을 구성합니다. 수학 및 일반적인 추론 벤치마크에 대한 광범위한 실험 결과, OGER는 경쟁적인 기본 모델보다 훨씬 뛰어난 성능을 보이며, 수학적 추론 능력에서 상당한 향상을 보일 뿐만 아니라, 도메인 외부 작업에 대한 강력한 일반화 능력을 유지합니다. 우리는 훈련 동역학에 대한 종합적인 분석을 제공하고, 엔트로피 기반 보상 조절의 효과를 검증하기 위한 상세한 분석 실험을 수행했습니다. 당사의 코드는 https://github.com/ecoli-hit/OGER.git 에서 확인할 수 있습니다.

Original Abstract

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

0 Citations

0 Influential

25.5 Altmetric

127.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!