2602.01207v1 Feb 01, 2026 cs.AI

모든 선호도가 동등하지는 않다: 추론 모델을 위한 안정성 인식 및 그래디언트 효율적 정렬

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models

Shuaiqiang Wang

Citations: 2,409

h-index: 19

Dawei Yin

Citations: 1,381

h-index: 19

Yuchen Li

Citations: 60

h-index: 4

Hui Wu

Citations: 7

h-index: 2

Hengyi Cai

Citations: 467

h-index: 9

Zhejun Zhao

Citations: 17

h-index: 3

Ziheng Li

Citations: 50

h-index: 5

Jinman Zhao

Citations: 6

h-index: 2

Xinran Chen

Citations: 696

h-index: 12

선호도 기반 정렬은 대규모 추론 모델 훈련에 있어 중추적인 역할을 합니다. 그러나 직접 선호 최적화(DPO)와 같은 표준 방법론은 일반적으로 모든 선호도 쌍을 균일하게 취급하여 훈련 인스턴스의 변화하는 효용성을 간과합니다. 이러한 정적 접근 방식은 그래디언트가 미미한 사소한 쌍에 연산을 낭비하고 불확실한 결정 경계 근처의 샘플로 인한 노이즈에 시달리게 되어, 종종 비효율적이거나 불안정한 최적화로 이어집니다. 이러한 문제에 직면하여, 우리는 정책 업데이트의 신호 대 잡음비(SNR)를 극대화함으로써 정렬 신뢰성을 향상시키도록 설계된 동적 프레임워크인 SAGE(Stability-Aware Gradient Efficiency)를 제안합니다. 구체적으로 SAGE는 모델 역량에 따라 후보 풀을 갱신하는 거시적(coarse-grained) 커리큘럼 메커니즘과, 불안정한 샘플을 걸러내는 동시에 정보가 풍부하고 확신 있는 오류에 우선순위를 두는 미시적(fine-grained)이고 안정성을 고려한 채점 함수를 통합합니다. 여러 수학적 추론 벤치마크에 대한 실험을 통해 SAGE가 수렴 속도를 크게 가속화하고 정적 베이스라인을 능가함을 입증하였으며, 이는 추론 정렬에 있어 정책을 인식하고 안정성을 고려한 데이터 선택의 중요한 역할을 강조합니다.

Original Abstract

Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!