2604.26644v1 Apr 29, 2026 cs.AI

언제 투표하고, 언제 재작성할 것인가: 불일치성 기반 전략 라우팅을 통한 테스트 시간 확장

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

Zhimin Lin

Citations: 12

h-index: 2

Jinpeng Li

Citations: 0

h-index: 0

Yu-Mei Luo

Citations: 55

h-index: 2

Dong Li

Citations: 134

h-index: 4

Juntao Li

Citations: 2,528

h-index: 24

Min Zhang

Citations: 203

h-index: 10

Yixin Ji

Citations: 200

h-index: 10

Junhua Fang

Citations: 0

h-index: 0

대규모 추론 모델(LRM)은 수학적 추론 작업에서 뛰어난 성능을 보이지만, 어려운 경우에 대한 신뢰성은 여전히 낮습니다. 기존의 테스트 시간 확장 방법(예: 반복 샘플링, 자기 수정, 트리 검색)은 성능을 향상시키지만, 계산 비용이 증가하며, 특히 어려운 문제에서는 효과가 미미한 경우가 많습니다. 우리는 출력 불일치성이 문제의 난이도와 예측 정확도와 밀접한 관련이 있음을 확인했으며, 이는 테스트 시간에 문제 수준의 전략 선택을 안내하는 데 유용한 신호가 됩니다. 이러한 통찰력을 바탕으로, 본 연구에서는 테스트 시간 확장을 단일 전략 내의 계산량을 늘리는 것이 아니라, 문제 수준의 라우팅 문제로 정의하는 훈련이 필요 없는 프레임워크를 제안합니다. 이 프레임워크는 출력 불일치성에 따라 다양한 확장 전략을 동적으로 선택하며, 일관된 경우 가벼운 해결 방법을 적용하고, 중간 정도의 불일치는 다수 투표를 사용하며, 매우 모호한 경우에는 재작성 기반의 재구성을 사용합니다. 7개의 수학적 벤치마크와 3개의 모델에 대한 실험 결과, 제안된 방법은 기존 접근 방식에 비해 정확도를 3%에서 7% 향상시키면서 샘플링 비용을 줄이는 것으로 나타났습니다.

Original Abstract

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!