2602.14917v1 Feb 16, 2026 cs.CL

BFS-PO: 대규모 추론 모델을 위한 최적 우선 탐색

BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo

Citations: 5

h-index: 1

Wenhui Tan

Citations: 188

h-index: 7

E. Sangineto

Citations: 4,441

h-index: 29

Ruihua Song

Citations: 126

h-index: 5

Rita Cucchiara

Citations: 1

h-index: 1

OpenAI o1 및 DeepSeek-R1과 같은 대규모 추론 모델(LRM)은 긴 추론 과정을 활용하여 뛰어난 성능을 보여주었습니다. 그러나 이는 계산 비용의 상당한 증가와 지나치게 장황한 결과물 생성, 즉 과도한 사고(overthinking) 현상을 야기했습니다. 이러한 과도한 사고 경향은 GRPO/DAPO와 같은 강화 학습(RL) 알고리즘에 의해 종종 악화됩니다. 본 논문에서는 최적 우선 탐색(Best-First Search) 전략을 사용하여 이러한 문제를 완화하는 RL 알고리즘인 BFS-PO를 제안합니다. 구체적으로, BFS-PO는 최대 엔트로피 노드를 기반으로 하는 백트래킹 메커니즘을 사용하여 가장 짧은 정확한 답변을 찾습니다. BFS-PO는 훈련 과정에서 점진적으로 짧은 응답을 생성함으로써 간결한 추론 과정을 학습하도록 설계되었습니다. 다양한 벤치마크와 기본 LRM을 사용하여 BFS-PO가 LRM의 정확도를 향상시키고 동시에 답변 길이를 단축시킬 수 있음을 보여줍니다.

Original Abstract

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

0 Citations

0 Influential

14.5 Altmetric

72.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!