2604.18235v1 Apr 20, 2026 cs.CL

부정적인 이점은 양날의 검: 심층 검색을 위한 GRPO에서의 이점 조정

Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

Jiayi Wu

Citations: 48

h-index: 4

Min Gao

Citations: 251

h-index: 9

Can Xu

Citations: 52

h-index: 4

Ruobing Xie

Citations: 327

h-index: 10

Zeqian Huang

Citations: 11

h-index: 2

Lei Jiang

Citations: 95

h-index: 5

Kangyang Luo

Citations: 192

h-index: 5

Xiang Li

Citations: 21

h-index: 3

심층 검색 에이전트는 검색 엔진과의 다중 턴 상호 작용을 자율적으로 시작하여 강력한 질문 답변 능력을 보여줍니다. 이러한 성능은 핵심 훈련 알고리즘으로서 Group Relative Policy Optimization (GRPO)에 크게 의존합니다. 그러나 GRPO는 여전히 심층 검색 환경에서 여러 가지 어려움에 직면하고 있습니다. 첫째, 중간 단계의 정확성과 보상 신호 간에 상당한 불일치가 존재하여 최종 답변이 잘못된 경우, 많은 올바른 중간 단계가 잘못된 페널티를 받습니다. 둘째, 훈련 과정이 불안정하여 자연어 능력의 저하 또는 심지어 훈련 실패로 이어지는 경우가 많습니다. 우리의 분석에 따르면 이러한 문제는 거칠게 조정된 이점 할당과 양수 및 음수 이점 간의 불균형으로 인해 발생합니다. 이러한 문제를 해결하기 위해, 우리는 심층 검색 작업에 특화된 이점 조정 방법인 CalibAdv를 제안합니다. 특히, CalibAdv는 중간 단계의 정확성을 활용하여 과도한 음수 이점을 미세한 수준에서 축소합니다. 그런 다음, 답변 구성 요소에서 양수 및 음수 이점을 재조정합니다. 세 가지 모델과 일곱 개의 벤치마크에 대한 광범위한 실험 결과, CalibAdv가 모델 성능과 훈련 안정성을 모두 향상시키는 것으로 나타났습니다. 저희의 코드는 https://github.com/wujwyi/CalibAdv 에서 확인할 수 있습니다.

Original Abstract

Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

0 Citations

0 Influential

30.493061443341 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!