2604.01170v1 Apr 01, 2026 cs.LG

온라인 추론 교정: 테스트 시간 훈련을 통한 일반화 가능한 컨포멀 LLM 추론

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Cai Zhou

Citations: 76

h-index: 6

Stephen Bates

Citations: 28

h-index: 3

T. Jaakkola

Citations: 56,222

h-index: 110

Q. Zhu

Citations: 1

h-index: 1

Zekai Wang

Citations: 7

h-index: 1

Menghua Wu

Citations: 1

h-index: 1

Flora C. Shi

Citations: 2

h-index: 1

Chenyu Wang

Citations: 13

h-index: 2

Ashia Wilson

Citations: 15

h-index: 2

테스트 시간 스케일링은 대규모 언어 모델이 매우 어려운 문제를 해결할 수 있도록 했지만, 최첨단 결과는 엄청난 계산 비용을 필요로 합니다. 이러한 비효율성은 사전 훈련된 언어 모델의 부정확한 교정 및 인기 있는 샘플링 기법의 교정 부족에서 비롯됩니다. 본 논문에서는 컨포멀 예측과 테스트 시간 훈련을 활용하여 샘플링 과정을 교정하는 프레임워크인 온라인 추론 교정(ORCA)을 제시합니다. 특히, 각 입력에 대해 교정 모듈을 업데이트하는 메타 학습 절차를 도입했습니다. 이를 통해 다양한 추론 단계에서 나타나는 사고 패턴이나 모델 개발과 배포 간의 프롬프트 분포와 같은 분포 변화 하에서 유효한 신뢰도 추정을 제공할 수 있습니다. ORCA는 컨포멀 위험에 대한 이론적 보장을 제공할 뿐만 아니라, 다양한 추론 작업에서 더 높은 효율성과 일반화 성능을 경험적으로 보여줍니다. 위험 수준이 $δ=0.1$일 때, ORCA는 Qwen2.5-32B 모델의 in-distribution 작업에서의 효율성을 향상시키며, 지도 학습 레이블을 사용할 경우 최대 47.5%, 자기 일관성 레이블을 사용할 경우 최대 40.7%의 성능 향상을 보입니다. 제로샷 out-of-domain 환경에서는, ORCA가 정적 교정 기준선 대비 MATH-500 문제 해결에 필요한 계산량을 24.8%에서 67.0%로 줄이는 동시에 낮은 경험적 오류율을 유지합니다. 이러한 경향은 다양한 모델 계열 및 다운스트림 벤치마크에서도 나타납니다. 저희 코드는 https://github.com/wzekai99/ORCA 에서 공개적으로 이용 가능합니다.

Original Abstract

While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.

0 Citations

0 Influential

55.493061443341 Altmetric

277.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!