2601.22776v1 Jan 30, 2026 cs.AI

TSPO: 다중 턴 검색 정책 최적화에서의 이중 균질화 딜레마 타파

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Xingzhe Wu

Citations: 1,327

h-index: 4

Shichao Ma

Citations: 9

h-index: 1

Zhiyuan Ma

Citations: 24

h-index: 2

Xiaofan Li

Citations: 1

h-index: 1

Yu Cheng

Citations: 4

h-index: 1

Weiqiang Wang

Citations: 24

h-index: 3

Zhen-Qiang Zhou

Citations: 172

h-index: 9

Jintao Du

Citations: 23

h-index: 3

Ming Yang

Citations: 44

h-index: 3

Qiliang Liu

Citations: 101

h-index: 2

Yang Wang

Citations: 5

h-index: 1

다중 턴 도구 통합 추론은 대규모 언어 모델(LLM)이 반복적인 정보 검색을 통해 복잡한 작업을 해결할 수 있게 합니다. 그러나 검색 증강 추론을 위한 현재의 강화 학습(RL) 프레임워크는 주로 희소한 결과 수준 보상에 의존하여 '이중 균질화 딜레마'를 초래합니다. 이는 (1) 생성에 포함된 사고, 추론 및 도구 사용이 무시되는 '과정 균질화', 그리고 (2) 거친 입도의 결과 보상이 샘플링 중 GRPO와 같은 방법을 사용할 때 그룹 내 이점 추정의 비효율성을 초래하는 '그룹 내 균질화'로 나타납니다. 이를 해결하기 위해 본 논문에서는 턴 수준 단계 인식 정책 최적화(TSPO)를 제안합니다. TSPO는 최초 발생 잠재 보상(FOLR) 메커니즘을 도입하여 정답이 처음 등장하는 단계에 부분 보상을 할당함으로써, 외부 보상 모델이나 별도의 주석 없이도 과정 수준 신호를 보존하고 그룹 내 보상 분산을 증가시킵니다. 광범위한 실험을 통해 TSPO가 최신 베이스라인 모델들을 크게 능가하며, Qwen2.5-3B 및 7B 모델에서 각각 평균 24%와 13.6%의 성능 향상을 달성했음을 입증했습니다.

Original Abstract

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!