2605.07353v1 May 08, 2026 cs.AI

신뢰도 기반 정렬이 추론 LLM의 신뢰성을 향상시킨다

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Jiawen Zhang

Citations: 24

h-index: 4

Kejia Chen

Citations: 126

h-index: 4

Jian Lou

Citations: 21

h-index: 3

Zunlei Feng

Citations: 7

h-index: 2

Ruoxi Jia

Citations: 160

h-index: 8

Yihong Wu

Citations: 48

h-index: 3

Kewei Gao

Citations: 8

h-index: 2

Min-Gyoo Song

Citations: 423

h-index: 11

대규모 추론 모델은 종종 잘못된 중간 단계를 거쳐 정답을 도출하며, 이는 최종 정확도와 추론 신뢰성 간의 격차를 야기합니다. 기존의 정렬 전략은 외부 검증기 또는 대규모 샘플링을 사용하여 이 문제를 해결하지만, 이는 확장성에 제한을 둡니다. 본 연구에서는 별도의 보상 모델을 훈련하지 않고, 반복적인 직접 선호도 최적화(Direct Preference Optimization)를 통해 토큰 수준의 신뢰도를 단계별 논리적 정확성과 일치시키는 프레임워크인 CASPO(Confidence-Aware Step-wise Preference Optimization)를 소개합니다. 추론 과정에서, 우리는 Calibrated Confidence를 활용하여 불확실한 추론 경로를 동적으로 제거하는 Confidence-aware Thought (CaT)를 제안합니다. 십 개의 벤치마크와 다양한 모델 패밀리에 대한 실험 결과, CASPO는 추론 신뢰성과 추론 효율성을 지속적으로 향상시키는 것으로 나타났습니다. CASPO는 Qwen3-8B-Base 모델에 적용 가능하며, 보상 모델 데이터를 사용하지 않고 AIME'24 및 AIME'25 벤치마크에서 트리 탐색 기반 모델보다 뛰어난 성능을 보였습니다. 또한, 단계별 데이터셋을 신뢰도 주석과 함께 공개하여 추론 신뢰성에 대한 정밀한 분석을 지원합니다. 관련 코드는 https://github.com/Thecommonirin/CASPO 에서 확인할 수 있습니다.

Original Abstract

Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME'24 and AIME'25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.

0 Citations

0 Influential

28.9657359028 Altmetric

144.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!