2603.13134v1 Mar 13, 2026 cs.AI

정과 오류의 조화: 보상-신뢰 교정을 통한 양방향 컨텍스트 조건부 GRPO

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

Tian Lan

Citations: 13

h-index: 2

Yu Li

Citations: 8

h-index: 2

Zhengling Qi

Citations: 11

h-index: 1

그룹 상대 정책 최적화(GRPO)는 추론 모델 학습에 효과적인 방법으로 떠올랐습니다. GRPO는 그룹 평균을 기반으로 이점을 계산하지만, 최적화 과정에서 각 출력을 독립적인 샘플로 취급하며, 동일 그룹 내에서 올바른 솔루션과 오류 솔루션 간의 중요한 구조적 신호를 간과합니다. 이는 성공적인 추론 과정을 실패한 추론 과정과 명시적으로 대조하여 활용할 수 있는 풍부한 비교 데이터를 무시하는 것입니다. 이를 활용하기 위해, 우리는 GRPO의 대조적 재구성을 제시하며, GRPO 목표가 암묵적으로 올바른 샘플과 오류 샘플의 정책 비율 간의 마진을 최대화한다는 것을 보여줍니다. 이러한 통찰력을 바탕으로, 우리는 양방향 컨텍스트 조건부(BICC)라는 메커니즘을 제안합니다. BICC는 모델이 최적화 과정에서 성공적인 추론 과정과 실패한 추론 과정을 교차 참조할 수 있도록 하여, 샘플 간 직접적인 정보 흐름을 가능하게 합니다. 또한, 우리는 GRPO의 이점 기준선을 동적으로 조정하여 훈련을 안정화하는 보상-신뢰 교정(RCC)을 도입합니다. RCC는 분산 최소화 추정기의 1차 근사를 통해 얻은 보상-신뢰 공분산을 사용합니다. 이러한 두 가지 메커니즘 모두 추가적인 샘플링이나 보조 모델이 필요 없으며, 모든 GRPO 변형에 적용될 수 있습니다. 수학적 추론 벤치마크에 대한 실험 결과, 다양한 모델과 알고리즘에서 일관된 성능 향상을 보였습니다. 코드는 다음 링크에서 확인할 수 있습니다: [https://github.com/Skylanding/BiCC](https://github.com/Skylanding/BiCC)

Original Abstract

Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.

4 Citations

0 Influential

26.493061443341 Altmetric

136.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!