2602.10048v1 Feb 10, 2026 cs.LG

세분화된 그룹 정책 최적화를 통한 긴 연쇄적 사고 압축

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Lu Yin

Citations: 23

h-index: 3

Xinchen Han

Citations: 28

h-index: 3

Hossam Afifi

Citations: 66

h-index: 4

M. Marot

Citations: 452

h-index: 12

Xilu Wang

Citations: 1,526

h-index: 13

대규모 언어 모델(LLM)은 종종 불필요하게 장황한 연쇄적 사고(CoT) 추론을 생성하여, 성능 향상에 비례하지 않는 방식으로 계산 비용과 지연 시간을 증가시킵니다. 본 논문에서는, 그룹 응답을 세분화하고 길이 및 엔트로피에 따라 적절한 가중치를 부여하여 효과적인 CoT 압축을 가능하게 하는 강화 학습(RL) 알고리즘인 **F**ine-grained **G**roup policy **O**ptimization (**FGO**)을 제안합니다. 또한, FGO는 Group Relative Policy Optimization (GRPO)의 개선된 변형으로서, GRPO의 주요 두 가지 한계점인 비효율적인 데이터 활용 및 엔트로피 붕괴 문제를 성공적으로 해결합니다. MATH500, AIME24, AMC23, Minerva 등 다양한 추론 LLM 및 벤치마크에서 FGO를 평가한 결과, FGO는 성능 저하 없이 효율적인 CoT 압축을 달성하며, 동시에 GRPO의 주요 한계점을 해결하는 것을 확인했습니다.

Original Abstract

Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!