2602.02143v1 Feb 02, 2026 cs.LG

최적의 N개 선택을 위한 생성적 선택 학습

Learning Generative Selection for Best-of-N

Somshubra Majumdar

Citations: 4,100

h-index: 24

V. Noroozi

Citations: 2,419

h-index: 19

Aleksander Ficek

Citations: 501

h-index: 10

Siddhartha Jain

Citations: 502

h-index: 6

Igor Gitman

Citations: 2,490

h-index: 15

Wei Du

Citations: 520

h-index: 7

Shubham Toshniwal

Citations: 4,771

h-index: 24

Sadegh Mahdavi

University of British Columbia

Citations: 174

h-index: 7

병렬 샘플링을 통한 테스트 시간 연산 확장은 LLM의 추론 능력을 크게 향상시킬 수 있지만, 종종 최적의 N개 선택의 품질에 의해 제한됩니다. GenSelect와 같은 생성적 선택 방법은 이러한 병목 현상을 해결하지만, 강력한 선택 성능은 여전히 주로 대규모 모델에 국한되어 있습니다. 본 연구에서는 작은 추론 모델이 목표 지향적인 강화 학습을 통해 강력한 GenSelect 능력을 습득할 수 있음을 보여줍니다. 이를 위해, 대규모 수학 및 코드 instruction 데이터셋에서 정확하고 부정확한 후보 해답을 가진 인스턴스를 필터링하여 선택 작업을 생성하고, DAPO를 사용하여 정확한 선택에 보상을 제공하며 17억 개의 파라미터를 가진 모델을 학습했습니다. 수학 (AIME24, AIME25, HMMT25) 및 코드 (LiveCodeBench) 추론 벤치마크에서, 저희 모델은 일관되게 프롬프트 기반 및 다수결 기반 모델을 능가하며, 종종 훨씬 더 큰 모델에 근접하거나 능가하는 성능을 보입니다. 더욱이, 이러한 이점은 더 강력한 모델의 출력도 선택할 수 있을 정도로 일반화됩니다. 전반적으로, 본 연구의 결과는 강화 학습이 작은 모델에서 강력한 생성적 선택을 가능하게 하는 확장 가능한 방법임을 입증하며, 이를 통해 효율적인 테스트 시간 확장을 가능하게 합니다.

Original Abstract

Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!