2603.27977v1 Mar 30, 2026 cs.AI

SARL: 추론 토폴로지를 활용한 라벨 없는 강화 학습

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

Bolian Li

Citations: 115

h-index: 4

Yifan Wang

Citations: 43

h-index: 2

David Cho

Citations: 6

h-index: 1

Ruqi Zhang

Citations: 715

h-index: 12

Fanping Sui

Citations: 248

h-index: 9

Ananth Grama

Citations: 1

h-index: 1

강화 학습은 대규모 추론 모델의 성능 향상에 중요한 역할을 하지만, 여전히 검증 가능한 보상 또는 레이블 기반의 감독 학습에 크게 의존합니다. 이는 정답이 모호하고 검증하기 어려운 개방형 영역에 대한 적용 가능성을 제한합니다. 또한, 추론 경로는 대부분 제약 없이 진행되며, 최종 답변에 대한 최적화는 일반화보다 초기 활용을 선호할 수 있습니다. 본 연구에서는 모델이 무엇을 생성하는지(추론 결과)가 아닌, 어떻게 생각하는지(추론 구조)를 가르쳐 일반적인 추론 능력을 향상시킬 수 있는지 탐구하고, 전통적인 강화 학습 기반 추론(RLVR)을 개방형 환경으로 확장합니다. 우리는 구조 인식 강화 학습(SARL)이라는 새로운 라벨 없는 프레임워크를 소개합니다. SARL은 각 응답에 대한 추론 지도를 생성하고, 복잡 네트워크와 인간 뇌의 기능적 조직에서 영감을 받아 작은 세상(small world) 토폴로지를 갖는 추론 경로에 보상을 제공합니다. SARL은 지역적으로 일관되고 전반적으로 효율적인 추론 경로를 장려하며, 감독 학습의 초점을 결과에서 경로로 이동시킵니다. Qwen3-4B 모델에 대한 실험 결과, SARL은 ground truth 기반 강화 학습 및 기존 라벨 없는 강화 학습 방법보다 우수한 성능을 보였습니다. PPO 알고리즘에서는 평균 9.1%의 성능 향상, GRPO 알고리즘에서는 11.6%의 성능 향상을 보였으며, 수학 문제에서는 각각 34.6% (PPO), 30.4% (GRPO)의 성능 향상을 보였습니다. 또한, SARL은 낮은 KL 발산 값과 높은 정책 엔트로피 값을 나타내어, 더욱 안정적이고 탐색적인 학습과 일반화된 추론 능력을 갖는 것으로 나타났습니다.

Original Abstract

Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.

1 Citations

0 Influential

6 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!