2601.10349v1 Jan 15, 2026 cs.LG

SuS: 전략 인식 기반의 내재적 탐색을 위한 놀라움 활용

SuS: Strategy-aware Surprise for Intrinsic Exploration

Citations: 0

h-index: 0

Citations: 0

h-index: 0

본 논문에서는 강화 학습에서 탐색을 위한 새로운 내재적 동기 부여 프레임워크인 전략 인식 기반의 놀라움(SuS)을 제안합니다. 기존의 호기심 기반 방법이 상태 예측 오류에만 의존하는 것과 달리, SuS는 전략 안정성(SS)과 전략 놀라움(SuS)이라는 두 가지 상호 보완적인 구성 요소를 도입합니다. SS는 시간 단계에 걸쳐 행동 전략의 일관성을 측정하며, SuS는 에이전트의 현재 전략 표현과 관련된 예상치 못한 결과를 포착합니다. 우리의 결합된 보상 공식은 학습된 가중치 계수를 통해 두 신호 모두를 활용합니다. 우리는 대규모 언어 모델을 사용하여 수학적 추론 작업에서 SuS를 평가했으며, 정확도와 해결책 다양성 측면에서 상당한 개선을 보였습니다. 삭제 연구 결과, 어떤 구성 요소를 제거하더라도 성능이 최소 10% 이상 저하되는 것으로 나타났으며, 이는 우리 접근 방식의 상승 효과를 검증합니다. SuS는 기준 방법과 비교하여 Pass@1에서 17.4% 개선, Pass@5에서 26.4% 개선을 달성했으며, 훈련 과정 전반에 걸쳐 더 높은 전략 다양성을 유지합니다.

Original Abstract

We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.

0 Citations

0 Influential

0 Altmetric

0.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!