2602.14169v1 Feb 15, 2026 cs.LG

LLM 강화 학습을 위한 피벗 기반 리샘플링을 통한 심층 집중 탐색

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

Yiran Guo

Citations: 55

h-index: 2

Zhongjian Qiao

Citations: 84

h-index: 5

Yingqi Xie

Citations: 2

h-index: 1

Jie Liu

Citations: 305

h-index: 4

Dan Ye

Citations: 265

h-index: 3

Shuang Qiu

Citations: 1

h-index: 1

Lijie Xu

Citations: 49

h-index: 1

Ruiqing Zhang

Citations: 250

h-index: 7

대규모 언어 모델(LLM) 강화 학습에서 효과적인 탐색은 중요한 과제입니다. 탐색은 방대한 자연어 시퀀스 공간에서 제한된 샘플링 예산 내에서 고품질의 경로를 발견하는 것을 의미합니다. 기존 방법들은 다음과 같은 제한점을 가지고 있습니다. GRPO는 루트 노드에서만 샘플링하여 고확률 경로를 과도하게 탐색하는 반면, 깊고 오류가 발생하기 쉬운 상태는 충분히 탐색되지 않습니다. 트리 기반 방법은 예산을 의미 없는 또는 복구 불가능한 상태에 무작정 분산시켜 샘플링의 효율성을 저해하고, 드물게 나타나는 올바른 접미사를 발견하지 못하게 하며, 로컬 기준선을 불안정하게 만듭니다. 이러한 문제를 해결하기 위해, 우리는 실패한 경로 내의 깊고 복구 가능한 상태인 '피벗'에 집중하는 탐색 전략인 심층 집중 탐색(DDE)을 제안합니다. 우리는 DDE를 DEEP-GRPO에 구현하여 세 가지 주요 혁신을 도입했습니다. (1) 데이터 기반의 경량 유틸리티 함수를 사용하여 복구 가능성과 깊이 편향을 자동으로 균형 있게 조정하여 피벗 상태를 식별합니다. (2) 각 피벗 지점에서 로컬 밀집 리샘플링을 수행하여 올바른 후속 경로를 발견할 확률을 높입니다. (3) 전역 정책 학습과 로컬 수정 업데이트를 분리하는 이중 스트림 최적화 목표를 사용합니다. 수학적 추론 벤치마크에서의 실험 결과, 제안하는 방법은 GRPO, 트리 기반 방법 및 다른 강력한 기준 모델보다 일관되게 우수한 성능을 보였습니다.

Original Abstract

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!