2601.12465v1 Jan 18, 2026 cs.CL

프로세스 이점 형성을 통한 장문 맥락에서의 심층적인 추론 유도

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Nuo Chen

Citations: 469

h-index: 11

Jia Li

Citations: 157

h-index: 8

Miao Peng

Citations: 4

h-index: 1

Weizhou Shen

Citations: 822

h-index: 9

Chenliang Li

Citations: 39

h-index: 4

Ming Yan

Citations: 130

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 LLM의 단문 맥락 추론 능력을 향상시키는 데 효과적이었지만, 정확한 근거와 견고한 장거리 추론이 필요한 장문 맥락 환경에서는 성능이 저하됩니다. 본 연구에서는 장문 맥락 추론에서 발생하는 '거의 정답' 현상을 분석하고, 그 원인을 두 가지 요인으로 규명했습니다. 첫째, LLM이 단순한 근거를 넘어 정교한 다중 단계 추론을 수행하도록 유도하는 고밀도의 추론 데이터가 부족하다는 점입니다. 둘째, 부분적으로 정확하지만 최종 결과가 틀린 경로에 대한 무분별한 페널티로 인해 장문 맥락 강화 학습 과정에서 귀중한 학습 신호가 손실된다는 점입니다. 이러한 문제점을 해결하기 위해, 지식 그래프(KG) 기반의 합성 프레임워크인 DeepReasonQA를 제안합니다. DeepReasonQA는 내재적인 추론 과정을 가진 고난이도의 다중 단계 장문 맥락 질의응답 쌍을 효과적으로 구성합니다. 이를 바탕으로, 장문 맥락 프로세스 이점 형성(LongPAS)이라는 간단하면서도 효과적인 방법을 소개합니다. LongPAS는 유효성과 관련성 측면에서 추론 단계를 세밀하게 평가하여, '거의 정답' 경로에서 중요한 학습 신호를 추출합니다. 세 가지 장문 맥락 추론 벤치마크에 대한 실험 결과, 제안하는 방법은 RLVR 기반 모델보다 훨씬 우수한 성능을 보이며, 훨씬 적은 파라미터로 최첨단 LLM에 버금가는 성능을 달성했습니다. 추가 분석을 통해, 제안하는 방법이 장문 맥락 추론 능력을 강화하면서도 안정적인 강화 학습을 유지하는 데 효과적임을 확인했습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!