2602.10609v1 Feb 11, 2026 cs.CL

안정적이고 효과적인 정책 최적화를 위한 온라인 인과 칼만 필터링

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Lang Feng

Citations: 729

h-index: 11

Shuo He

Citations: 226

h-index: 8

Xin Cheng

Citations: 33

h-index: 3

Lei Feng

Citations: 60

h-index: 5

Bo An

Citations: 18

h-index: 1

대규모 언어 모델을 위한 강화 학습은 높은 분산을 갖는 토큰 레벨 중요 샘플링 비율(IS 비율)로 인해 정책 최적화의 안정성을 저해할 수 있습니다. 최근의 방법들은 일반적으로 시퀀스 전체에 대해 고정된 시퀀스 레벨 IS 비율을 사용하거나, 각 토큰의 IS 비율을 개별적으로 조정하여 시퀀스 내 토큰 간의 시간적 오프라인 데이터 활용을 간과합니다. 본 논문에서는 먼저 실험적으로 토큰 레벨에서 국소적인 오프라인 편차가 구조적으로 일관되지 않다는 것을 확인하고, 이는 인접한 토큰 간의 정책 그래디언트 업데이트를 왜곡시켜 학습 실패를 초래할 수 있음을 밝힙니다. 이러한 문제를 해결하기 위해, 안정적이고 효과적인 정책 최적화를 위한 온라인 인과 칼만 필터링(KPO)을 제안합니다. 구체적으로, 원하는 IS 비율을 시퀀스 내 토큰에 걸쳐 변화하는 잠재 상태로 모델링하고, 칼만 필터를 사용하여 과거 토큰의 상태를 기반으로 이 상태를 온라인 및 자기 회귀적으로 업데이트합니다. 결과적으로 생성된 필터링된 IS 비율은 토큰별로 국소적인 구조를 고려한 변동성을 유지하면서 동시에 노이즈의 급격한 변화를 완화하여, 더욱 안정적이고 효과적인 정책 업데이트를 가능하게 합니다. 실험적으로, KPO는 최첨단 방법과 비교하여 어려운 수학 추론 데이터셋에서 우수한 성능을 달성했습니다.

Original Abstract

Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!