2602.03295v1 Feb 03, 2026 cs.CL

POP: 프리필 전용 가지치기 알고리즘을 이용한 효율적인 대규모 모델 추론

POP: Prefill-Only Pruning for Efficient Large Model Inference

Junhui He

Citations: 15

h-index: 2

Zhihui Fu

Citations: 174

h-index: 6

Jun Wang

Citations: 63

h-index: 4

Qing'an Li

Citations: 572

h-index: 15

대규모 언어 모델(LLM)과 시각-언어 모델(VLM)은 뛰어난 성능을 보여주지만, 높은 계산 비용으로 인해 실제 적용에 어려움이 있습니다. 기존의 구조적 가지치기 방법은 하드웨어 효율성이 높지만, 종종 상당한 정확도 저하를 초래합니다. 본 논문에서는 이러한 문제점이 단계에 대한 고려 없이 이루어지는 가지치기 방식에서 비롯된다고 주장합니다. 우리는 가상 게이트 메커니즘을 도입하여 중요도 분석을 수행한 결과, 심층 레이어가 다음 토큰 예측(디코딩)에는 매우 중요하지만, 문맥 인코딩(프리필)에는 대부분 불필요하다는 것을 발견했습니다. 이러한 통찰력을 바탕으로, 우리는 단계별로 고려하는 추론 전략인 프리필 전용 가지치기(POP)를 제안합니다. POP는 계산 비용이 많이 드는 프리필 단계에서는 심층 레이어를 안전하게 제거하고, 중요한 디코딩 단계에서는 전체 모델을 유지합니다. 단계 전환을 가능하게 하기 위해, 캐시 무결성을 유지하기 위한 독립적인 키-값(KV) 투영을 도입하고, 생성된 첫 번째 토큰의 정확성을 보장하기 위한 경계 처리 전략을 사용했습니다. Llama-3.1, Qwen3-VL, Gemma-3 등 다양한 모델에 대한 광범위한 실험 결과, POP는 프리필 지연 시간을 최대 1.37배 단축하면서도 최소한의 성능 저하를 보여주며, 기존의 구조적 가지치기 방법이 가진 정확도-효율성 간의 균형 문제를 효과적으로 해결합니다.

Original Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

1 Citations

0 Influential

7.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!