2602.09331v1 Feb 10, 2026 cs.CL

균일한 신용 할당을 넘어서: 정책 최적화를 위한 인과적 신용 할당

Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

M. Khandoga

Citations: 5,062

h-index: 35

Rui Yuan

Citations: 231

h-index: 6

Vinay Kumar Sankarapu

Citations: 59

h-index: 5

GRPO 및 DAPO와 같은 언어 모델 추론을 위한 정책 그래디언트 방법은 생성된 모든 토큰에 균일한 신용을 할당합니다. 예를 들어, "생각해 보겠습니다"와 같은 채움 문구는 "23 + 45 = 68"과 같은 중요한 계산과 동일한 그래디언트 업데이트를 받습니다. 우리는 반사실적 중요성 가중치를 제안합니다. 추론 단위를 마스킹하고, 답변 확률의 감소를 측정하며, 정책 그래디언트 업데이트 중에 해당 토큰의 가중치를 높입니다. 우리의 방법은 보조 모델이나 외부 어노테이션이 필요 없으며, 대신 중요성은 정책 모델 자체의 확률 변화로부터 직접 추정됩니다. Qwen 및 Llama 계열의 세 모델에 대한 GSM8K 데이터셋에서의 실험 결과, 우리의 방법은 균일한 기준선에 비해 일관된 성능 향상을 보여주며, 동등한 정확도에 더 빠르게 수렴합니다. 중요성 신호를 반전시키면 성능이 저하되어, 우리가 실제 인과적 구조를 포착하고 있다는 것을 확인했습니다. 분석 결과, 이 방법은 계산 단계에 대해 지지 문구보다 우선순위를 부여하는 것을 올바르게 수행합니다. 우리는 이러한 결과를 완전한 해결책이 아닌, 추가 연구를 위한 기반을 구축하는 것으로 간주합니다.

Original Abstract

Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.

6 Citations

0 Influential

17.5 Altmetric

93.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!