2601.05870v1 Jan 09, 2026 cs.LG

IIB-LPO: 반복적인 정보 병목 현상을 통한 잠재 정책 최적화

IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck

Long Li

Citations: 38

h-index: 3

Ming Li

Citations: 20

h-index: 3

Huilin Deng

Citations: 80

h-index: 3

Hongcheng Luo

Citations: 779

h-index: 16

Yue Zhu

Citations: 19

h-index: 3

Zhuoyue Chen

Citations: 2

h-index: 1

Xinghao Zhao

Citations: 0

h-index: 0

Jihai Zhang

Citations: 54

h-index: 4

Mengchang Wang

Citations: 0

h-index: 0

Yang Cao

Citations: 0

h-index: 0

Yu Kang

Citations: 61

h-index: 2

최근 대규모 언어 모델(LLM)의 추론 능력 향상을 위한 강화 학습(RLVR) 분야에서, 모델이 좁고 과도하게 최적화된 행동에 갇히는 '탐색 붕괴'라는 문제가 지속적으로 발생하고 있습니다. 무작위 시퀀스의 의미적 동질성은 이러한 문제를 야기합니다. 기존 방법들은 정책 엔트로피를 활용하여 탐색을 장려하지만, 이러한 방법들은 고유한 한계점을 가지고 있습니다. 전역 엔트로피 정규화는 '보상 해킹'에 취약하여 의미 없는 장황함을 유발할 수 있으며, 로컬 토큰 선택적 업데이트는 사전 훈련된 모델의 강한 유도 편향으로 인해 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 반복적인 정보 병목 현상을 통한 잠재 정책 최적화(IIB-LPO)라는 새로운 접근 방식을 제안합니다. IIB-LPO는 토큰 분포의 통계적 변화가 아닌, 추론 경로의 위상적 분기를 통해 탐색을 유도합니다. IIB-LPO는 고엔트로피 상태에서 잠재적 분기를 촉발하여 추론 경로를 다양화하고, 정보 병목 원칙을 추론 경로 필터와 자체 보상 메커니즘으로 모두 활용하여 간결하고 유용한 탐색을 보장합니다. 4가지 수학적 추론 벤치마크에 대한 실험 결과는 IIB-LPO가 최첨단 성능을 달성하며, 기존 방법보다 정확도에서 최대 5.3%, 다양성 지표에서 최대 7.4% 향상된 성능을 보임을 보여줍니다.

Original Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!