2602.13980v1 Feb 15, 2026 cs.AI

소프트 프롬프트를 위한 인지적 청킹: 블록 단위 인과적 마스킹을 통한 압축기 학습 가속화

Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

Jiecao Yu

Citations: 16,466

h-index: 10

Yiqi Wang

Citations: 8

h-index: 1

Songlei Jian

Citations: 2

h-index: 1

Jianfeng Zhang

Citations: 5

h-index: 1

Guojie Liu

Citations: 35

h-index: 3

Yanfeng Yang

Citations: 37

h-index: 4

Wenqi Fan

Citations: 174

h-index: 7

프롬프팅을 통해 방대한 맥락을 제공하는 것은 거대 언어 모델(LLM)의 기능을 활용하는 데 필수적입니다. 그러나 셀프 어텐션(self-attention)의 계산 비용이 시퀀스 길이에 따라 이차적으로 증가하기 때문에, 긴 맥락은 추론 지연 시간을 크게 증가시킵니다. 이 문제를 완화하기 위해 맥락 압축, 특히 훈련된 압축기를 통해 긴 맥락을 더 짧은 메모리 임베딩으로 변환하는 소프트 프롬프트 압축이 널리 연구되는 해결책으로 부상했습니다. 기존 방법들은 일반적으로 전체 맥락을 무차별적으로 일련의 메모리 토큰으로 압축하므로, 압축기가 전역적 의존성을 포착해야 하며 효과적인 패턴을 학습하기 위해 방대한 사전 훈련 데이터가 필요합니다. 인간 작업 기억의 청킹 메커니즘과 원본 토큰에 대한 메모리 임베딩의 공간적 전문화에 대한 경험적 관찰에서 영감을 받아, 본 논문에서는 병렬 반복 압축(Parallelized Iterative Compression, PIC)을 제안합니다. 트랜스포머의 어텐션 마스크를 간단히 수정함으로써, PIC는 메모리 토큰의 수용 영역(receptive field)을 연속적인 로컬 청크로 명시적으로 제한하여 압축기 학습의 난이도를 낮춥니다. 여러 다운스트림 작업에 대한 실험 결과, PIC가 경쟁 베이스라인보다 일관되게 우수한 성능을 보였으며, 특히 고압축 시나리오에서 그 우수성이 두드러졌습니다(예: 64배 압축률의 QA 작업에서 F1 점수 29.8%, EM 점수 40.7%의 상대적 향상 달성). 또한, PIC는 학습 과정을 상당히 가속화합니다. 구체적으로 16배 압축기를 훈련할 때, 경쟁 베이스라인의 최고 성능을 능가하면서도 훈련 시간을 약 40% 효과적으로 단축했습니다.

Original Abstract

Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the $64\times$ compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16$\times$ compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!