2605.07276v1 May 08, 2026 cs.AI

약한 피드백 환경에서의 에이전트 기반 코드 수정에서 GRPO를 위한 신호 재구성

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Ting Peng

Citations: 55

h-index: 4

Jia Li

Citations: 11

h-index: 2

Yuxin Su

Citations: 1,995

h-index: 22

Michael R. Lyu

Citations: 55

h-index: 3

Hailiang Huang

Citations: 77

h-index: 4

Yuetang Deng

Citations: 54

h-index: 4

코드 에이전트 강화학습은 종종 약한 피드백을 받습니다. 실행 시간 신호는 신뢰할 수 있고 실행 가능하지만, 작업 성공에 필요한 조건이나 표면적인 조건만을 반영하며, 목표 의미론적 조건을 포착하지 못합니다. 본 연구에서는 에이전트 기반 컴파일-수정 방식을 설정으로 하여, 이러한 피드백 환경에서 표준 GRPO를 위한 신호 재구성을 연구합니다. 우리의 핵심 주장은 GRPO의 그룹 내 비교가 세 가지 유형의 신호가 재구성된 후에만 의미가 있다는 것입니다. 즉, 결과 보상은 의미론적 순위를 복구하고, 프로세스 신호는 경로 내에서 발생하는 효과를 정확하게 반영하며, 동일한 프롬프트에서 생성된 실행 결과는 비교 가능해야 합니다. 우리는 이러한 조건을 만족하는 최소한의 신호 재구성 방법을 제안하며, 이는 GRPO의 그룹 정규화된 이점 계산 방식을 변경하지 않습니다. 구체적으로, 컴파일 및 의미론적 계층화된 보상은 경로 순위를 재구성하고, 단계별 프로세스 점수는 그룹 보상 정규화 외부에서 경로 내 업데이트 강도를 재구성하며, 오류 원인 인지 롤아웃 관리는 그룹 내 비교 가능성을 재구성합니다. 실험 결과, 전체적으로 신호가 재구성된 GRPO는 기본 모델의 제로샷 성능인 $0.385$에서 $0.535$로 엄격한 컴파일 및 의미론적 정확도를 향상시키는 것을 보여줍니다. 추가적인 비교 실험은 이러한 성능 향상의 원인을 설명합니다. 이진 보상은 컴파일만 수행하는 중간 단계를 제거하고 경로 제어를 저하시키며, 계층화된 보상 위에 프로세스 점수 가중치를 추가하면 정확도가 $0.48$에서 $0.53$으로 향상되고 평균 평가 단계 수가 $23.50$에서 $17.02$로 감소합니다. 경계 비교로서, 특권 프롬프트 토큰 수준 증류는 주로 로컬 분포 정렬을 최적화합니다. 그러나 긴 도구 사용 경로에서는 이 신호가 중요하지 않은 토큰에 의해 희석되어 결과 의미론, 프로세스 효과, 그룹 내 비교 가능성을 대체할 수 없습니다.

Original Abstract

Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!