2605.14458v1 May 14, 2026 cs.AI

OmniDrop: 쿼리 기반 안내를 통한 다중 모드 LLM의 계층별 토큰 가지치기

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Hyemi Jang

Citations: 276

h-index: 6

Y. Park

Citations: 10

h-index: 1

Minseo Choi

Citations: 109

h-index: 4

Jong-Seok Lee

Citations: 3

h-index: 1

J. Choi

Citations: 22

h-index: 3

Y. Jeon

Citations: 0

h-index: 0

다중 모드 대규모 언어 모델(LLM)은 통합적인 다중 모드 이해 측면에서 놀라운 잠재력을 보여주었지만, 고해상도 오디오 및 비디오 입력으로 인해 발생하는 토큰 폭증은 실시간 응용 및 장문 추론에 있어 중요한 병목 현상입니다. 기존의 다중 모드 토큰 압축 방법은 일반적으로 입력 임베딩 수준에서 토큰을 가지치우며, 오디오-비디오 유사성 또는 시간적 동시 발생을 의미적 관련성의 대리 지표로 사용합니다. 하지만 이러한 가정은 실제로 종종 신뢰할 수 없습니다. 이러한 제한 사항을 해결하기 위해, 우리는 입력 수준이 아닌 LLM 디코더 계층 내에서 점진적으로 오디오 및 비디오 토큰을 가지치우는 훈련이 필요 없는 계층별 토큰 가지치기 프레임워크인 OmniDrop을 제안합니다. 이를 통해 초기 계층은 심층 계층에서 토큰을 적극적으로 제거하기 전에 충분한 다중 모드 정보 융합을 유지할 수 있습니다. 또한, 텍스트 쿼리를 활용하여 모달리티에 독립적이고 작업에 적응적인 토큰 가지치기를 수행합니다. 또한, 전역 시간적 컨텍스트를 유지하기 위해 균형 잡힌 토큰 생존을 장려하는 시간적 다양성 점수를 도입했습니다. 다양한 오디오-비디오 벤치마크에 대한 실험 결과는 OmniDrop이 모든 기준 모델보다 최대 3.58 포인트 더 우수하며, 프리필 지연 시간을 최대 40% 줄이고 메모리 사용량을 최대 14.7% 줄이는 것을 보여줍니다.

Original Abstract

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!