2604.02795v1 Apr 03, 2026 cs.CL

평가 기준을 토큰으로: 지시 따르기 작업에서 응답 수준의 평가 기준과 토큰 수준의 보상을 연결하는 방법

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

Tianze Xu

Citations: 57

h-index: 4

Lyumanshan Ye

Citations: 308

h-index: 7

Pengrui Lu

Citations: 234

h-index: 4

Yanzhao Zheng

Citations: 49

h-index: 3

Zhentao Zhang

Citations: 44

h-index: 4

Yuanqiang Yu

Citations: 24

h-index: 4

Jihuai Zhu

Citations: 0

h-index: 0

Chao Ma

Citations: 335

h-index: 5

Baohua Dong

Citations: 33

h-index: 3

Hangcheng Zhu

Citations: 10

h-index: 2

Ruohui Huang

Citations: 31

h-index: 3

Gang Yu

Citations: 4

h-index: 1

Yongkang Wu

Citations: 13

h-index: 2

Pengfei Liu

Citations: 203

h-index: 8

평가 기준 기반 강화 학습(RL)은 대규모 언어 모델(LLM)을 복잡하고 개방형 지시 따르기 작업에 맞추는 유망한 접근 방식으로 부상했습니다. 그러나 기존 방법은 주로 응답 수준의 보상에 의존하여 심각한 보상 희소성 및 보상 모호성 문제를 야기합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 응답 수준의 큰 틀의 점수와 세분화된 토큰 수준의 기여도 할당을 연결하는 새로운 평가 기준 기반 RL 프레임워크인 Rubrics to Tokens (RTT)를 제안합니다. RTT는 응답 내에서 특정 제약 조건을 담당하는 토큰을 예측하는 토큰 수준 관련성 판별기(Token-Level Relevance Discriminator)를 도입하고, 응답 수준 및 토큰 수준의 이점을 통합한 통합 프레임워크인 RTT-GRPO를 통해 정책 모델을 최적화합니다. 또한, 토큰 수준의 평가 기준 기반 RL에서 일차원적인 결과 수준의 보상에서 삼차원적인 보상 공간으로 전환할 때, 이러한 변화를 수용하기 위해 Intra-sample Token Group Normalization이라는 새로운 그룹 정규화 방법을 제안합니다. 광범위한 실험과 벤치마크 결과, RTT는 다양한 모델에서 지시 수준 및 평가 기준 정확도 모두에서 다른 기본 모델보다 일관되게 우수한 성능을 보였습니다.

Original Abstract

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!