2602.06822v1 Feb 06, 2026 cs.AI

POP: 대규모 파운데이션 모델의 효율적인 추론을 가능하게 하는 온라인 구조적 프루닝

POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

Yi Chen

Citations: 349

h-index: 9

Wonjin Shin

Citations: 1

h-index: 1

Tho Mai

Citations: 2

h-index: 1

Jeongmo Lee

Citations: 5

h-index: 1

Chuanbo Hua

Citations: 8

h-index: 1

Kunlun Wang

Citations: 2

h-index: 1

Jooyul Kim

Citations: 220

h-index: 5

Jun Liu

Citations: 64

h-index: 5

Shuhong Liu

Citations: 346

h-index: 8

대규모 파운데이션 모델(LFM)은 스케일링을 통해 강력한 성능을 달성하지만, 기존의 구조적 프루닝 방법들은 추론 시 고정된 프루닝 결정을 내리기 때문에 자기회귀적 토큰 생성 과정에서 발생하는 희소성 패턴을 간과하는 경향이 있습니다. 본 논문에서는 최소한의 연산 오버헤드로 문맥에 따른 동적 프루닝을 가능하게 하는 효율적인 온라인 구조적 프루닝 프레임워크인 POP(분할 유도 온라인 프루닝)를 제안합니다. POP는 모델 채널을 유지, 후보, 프루닝 영역으로 분할하며, 프리필(prefilling) 단계에서 대략적인 프루닝 분할을 정의하고 디코딩 단계에서 후보 영역 내의 세밀한 마스크를 생성함으로써 전체 채널에 대한 재평가를 방지합니다. 대략적인 프루닝 분할은 지속적으로 중요한 가중치를 보존하고, 세밀한 마스킹은 디코딩 중 문맥에 따른 변화를 반영합니다. 또한 POP는 오프라인 캘리브레이션, 재학습, 예측기 학습 등의 전처리가 필요 없는 가벼운 플러그 앤 플레이 방식입니다. 거대 언어 모델(LLM), 전문가 혼합 모델(MoE), 비전-언어 모델(VLM) 등 다양한 LFM에 대한 광범위한 평가를 통해, POP가 기존 프루닝 방식보다 적은 연산 오버헤드와 최소화된 추론 지연 시간으로 일관되게 더 높은 정확도를 제공함을 입증했습니다.

Original Abstract

Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!