2604.16535v2 Apr 16, 2026 cs.LG

SCATR: 간단하고 정확한 테스트 시간 순위 결정 방법

SCATR: Simple Calibrated Test-Time Ranking

C. Ekbote

Citations: 327

h-index: 6

Divya Shyamal

Citations: 2

h-index: 1

Marta Knevzevi'c

Citations: 0

h-index: 0

Lan Tran

Citations: 3

h-index: 1

Vijay Lingam

Citations: 68

h-index: 3

P. Liang

Citations: 64

h-index: 4

테스트 시간 스케일링(TTS)은 추론 시간에 추가적인 연산 자원을 할당하여 대규모 언어 모델(LLM)의 성능을 향상시키는 기술입니다. 일반적으로 TTS는 병렬 스케일링을 통해 구현되며, 여러 후보 응답을 생성하고 Best-of-N (BoN) 전략을 사용하여 최적의 응답을 선택합니다. 따라서 BoN 전략의 효과는 평가 함수의 성능에 크게 의존합니다. 프로세스 보상 모델(PRM)과 같은 학습된 평가 함수는 강력하지만, 학습 및 실행 비용이 많이 듭니다. 토큰 로그 확률을 기반으로 하는 가벼운 신뢰도 휴리스틱은 훨씬 저렴하지만, 종종 성능이 현저히 떨어지는 것을 확인했습니다. 더 강력한 학습된 평가 함수의 전체 비용을 지불하지 않고도 가벼운 신뢰도 휴리스틱을 개선하기 위해, 우리는 SCATR이라는 간단하고 효율적인 BoN 순위 결정 방법을 제안합니다. SCATR은 기본 모델의 숨겨진 표현을 사용하여 작은 보정 데이터 세트에서 가벼운 평가 함수를 학습합니다. 코딩 및 수학적 추론 벤치마크에서 SCATR은 기존의 신뢰도 기반 방법을 최대 9% 향상시켰습니다. 동일한 보정 데이터에 대한 LoRA 미세 조정과 비교하여, SCATR은 최대 8000배 더 적은 학습 가능한 파라미터와 훨씬 낮은 연산 비용으로 유사한 정확도를 달성하며, 학습 및 추론 지연 시간을 각각 최대 150배 및 1000배 줄였습니다. SCATR은 강력한 PRM 기반 방법과도 경쟁력이 있으며, 일부 설정에서는 수학 문제에서 최대 7.8%, 코딩 문제에서 최대 4.2%의 정확도를 향상시키면서 최대 1000배 더 빠른 추론을 가능하게 합니다. 전반적으로 SCATR은 확장 가능한 테스트 시간 선택을 위한 강력한 정확도-효율성 균형을 제공합니다.

Original Abstract

Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!