2604.21327v1 Apr 23, 2026 cs.LG

수학 추론을 위한 테스트 시간 강화 학습에서 발생하는 오해석 신호 증폭 현상 이해 및 완화

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Yongcan Yu

Citations: 40

h-index: 3

Qianlong Xie

Citations: 44

h-index: 4

Xingxing Wang

Citations: 23

h-index: 3

Jian Liang

Citations: 212

h-index: 7

Ran He

Citations: 53

h-index: 3

Lingxiao He

Citations: 25

h-index: 2

Kuangpu Guo

Citations: 24

h-index: 2

Meng Wang

Citations: 57

h-index: 4

테스트 시간 강화 학습(TTRL)은 추론 시간에 가짜 레이블을 사용하여 모델을 조정하는데, 이는 레이블 노이즈로 인한 오해석 최적화 신호에 취약하게 만듭니다. 경험적 연구를 통해, 중간 수준의 일관성을 보이는 응답들이 모호성 영역을 형성하며, 보상 노이즈의 주요 원인이 된다는 것을 관찰했습니다. 더욱 중요한 점은, 이러한 오해석 신호가 그룹 상대적 이점 추정 과정을 통해 더욱 증폭될 수 있다는 것을 발견했습니다. 이러한 연구 결과를 바탕으로, 오해석 신호를 완화하기 위한 통합 프레임워크인 '편향 제거 및 노이즈 감소 테스트 시간 강화 학습(DDRL)'을 제안합니다. 구체적으로, DDRL은 먼저 빈도 기반 샘플링 전략을 적용하여 모호한 샘플을 제외하면서 동시에 양성 및 음성 예제의 균형을 유지합니다. 그런 다음, 고정된 이점을 사용하는 편향 제거된 이점 추정 방법을 채택하여 그룹 상대적 정책 최적화로 인해 발생하는 편향을 제거합니다. 마지막으로, DDRL은 거부 샘플링된 데이터 세트를 활용하여 효율적이고 안정적인 모델 업데이트를 가능하게 하는 합의 기반 오프라인 개선 단계를 포함합니다. 세 개의 대규모 언어 모델을 사용하여 여러 수학적 추론 벤치마크에서 수행한 실험 결과, DDRL은 기존의 TTRL 기준 모델보다 일관되게 더 우수한 성능을 보였습니다. 코드 및 관련 자료는 곧 다음 주소에서 공개될 예정입니다: https://github.com/yuyongcan/DDRL.

Original Abstract

Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.

1 Citations

0 Influential

26.9657359028 Altmetric

135.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!