2601.07238v1 Jan 12, 2026 cs.AI

그룹 패턴 선택 최적화: LRM이 추론에 적절한 패턴을 선택하게 하기

Group Pattern Selection Optimization: Let LRMs Pick the Right Pattern for Reasoning

Hanbin Wang

Citations: 3

h-index: 1

Jingwei Song

Citations: 12

h-index: 1

Jinpeng Li

Citations: 3

h-index: 1

Fei Mi

Citations: 324

h-index: 11

Lifeng Shang

Citations: 118

h-index: 5

거대 추론 모델(LRM)은 다양한 고수준 추론 패턴(예: 직접 해결, 성찰 및 검증, 다중 해답 탐색)을 보이지만, 기존의 훈련 방식은 암묵적으로 모델을 제한된 지배적 패턴 집합으로 편향시킨다. 체계적인 분석을 통해 우리는 수학 및 과학 벤치마크에서 이러한 패턴 간에 상당한 정확도 차이가 있음을 확인했으며, 이는 특정 문제에 대해 모델의 기본 추론 패턴이 종종 최적이지 않음을 드러낸다. 이를 해결하기 위해 우리는 다중 패턴 롤아웃, 문제별 검증기 유도 최적 패턴 선택, 그리고 명시적 패턴 접미사가 학습된 정책으로 유출되는 것을 방지하기 위한 최적화 중 어텐션 마스킹을 통합하여 GRPO를 확장한 강화 학습 프레임워크인 그룹 패턴 선택 최적화(GPSO)를 소개한다. 다양한 추론 전략 포트폴리오를 탐색하고 가장 효과적인 전략에 대해 정책을 최적화함으로써, GPSO는 모델이 문제 특성에서 최적의 추론 패턴으로 이어지는 매핑을 내재화할 수 있도록 한다. 광범위한 실험 결과, GPSO는 다양한 모델 백본과 벤치마크 전반에서 일관되고 상당한 성능 향상을 달성하며, 패턴의 비최적성을 효과적으로 완화하고 더욱 견고하며 적응력 있는 추론을 촉진함을 입증했다. 모든 데이터와 코드는 https://github.com/wanghanbinpanda/GPSO 에서 확인할 수 있다.

Original Abstract

Large reasoning models (LRMs) exhibit diverse high-level reasoning patterns (e.g., direct solution, reflection-and-verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a limited set of dominant patterns. Through a systematic analysis, we identify substantial accuracy variance across these patterns on mathematics and science benchmarks, revealing that a model's default reasoning pattern is often sub-optimal for a given problem. To address this, we introduce Group Pattern Selection Optimization (GPSO), a reinforcement learning framework that extends GRPO by incorporating multi-pattern rollouts, verifier-guided optimal pattern selection per problem, and attention masking during optimization to prevent the leakage of explicit pattern suffixes into the learned policy. By exploring a portfolio of diverse reasoning strategies and optimizing the policy on the most effective ones, GPSO enables the model to internalize the mapping from problem characteristics to optimal reasoning patterns. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance gains across various model backbones and benchmarks, effectively mitigating pattern sub-optimality and fostering more robust, adaptable reasoning. All data and codes are available at https://github.com/wanghanbinpanda/GPSO.

1 Citations

0 Influential

25.5 Altmetric

128.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!