2601.21590v1 Jan 29, 2026 cs.LG

확장 가능한 파워 샘플링: 분포 정제를 통해 LLM의 효율적이고 학습이 필요 없는 추론 능력 향상

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Haitham Bou-Ammar

Citations: 2,591

h-index: 30

Rasul Tutunov

Citations: 781

h-index: 14

Matthieu Zimmer

Citations: 70

h-index: 5

Xiaotong Ji

Citations: 39

h-index: 4

강화 학습(RL)은 대규모 언어 모델(LLM)의 추론 성능을 향상시키는 주요 방법이지만, 최근 연구 결과는 이러한 성능 향상이 새로운 능력 습득보다는 분포 정제를 통해 주로 달성된다는 것을 시사합니다. 최근 연구에서는 마르코프 체인 몬테카를로(MCMC)를 사용하여 LLM의 파워 분포에서 샘플링하면 외부 보상 없이 RL 후학습과 유사한 성능을 얻을 수 있지만, MCMC의 높은 계산 비용은 이러한 접근 방식을 널리 채택하기 어렵게 만듭니다. 본 연구에서는 반복적인 MCMC가 필요 없는 이론적으로 뒷받침되는 새로운 방법을 제안합니다. 우리는 전역 파워 분포가 토큰 수준의 스케일링된 저온 분포로 근사될 수 있으며, 스케일링 계수는 향후 경로의 품질을 나타낸다는 새로운 공식을 도출했습니다. 이러한 통찰력을 바탕으로, 우리는 기본 모델의 생성 분포를 자기 회귀적으로 정제하는 학습이 필요 없고 검증기가 필요 없는 알고리즘을 소개합니다. 실험적으로, 우리는 네 가지 LLM에서 수학, 질문 답변, 코딩 작업에 대한 성능을 평가하고, 제안된 방법이 외부 보상이 필요 없이 원샷 GRPO와 동등하거나 더 나은 성능을 보이며, MCMC 기반 샘플링에 비해 추론 지연 시간을 10배 이상 줄일 수 있음을 확인했습니다.

Original Abstract

Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

14 Citations

3 Influential

15 Altmetric

95.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!