2603.28696v1 Mar 30, 2026 cs.CV

AdaptToken: 엔트로피 기반의 적응형 토큰 선택을 통한 멀티모달 대규모 언어 모델의 장편 비디오 이해

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Mahdi Rad

Citations: 230

h-index: 5

Kevin Qu

Citations: 149

h-index: 3

Haozhe Qi

Citations: 249

h-index: 5

Marc Pollefeys

Citations: 38

h-index: 3

Rui Wang

Citations: 18

h-index: 2

Alexander Mathis

Citations: 70

h-index: 5

장편 비디오 이해는 높은 메모리 비용과 컨텍스트 길이 제한으로 인해 멀티모달 대규모 언어 모델(MLLM)에게 여전히 어려운 과제입니다. 기존 연구들은 짧은 클립 내의 프레임/토큰을 평가하고 선택하여 이러한 문제를 완화하려 하지만, (i) 멀리 떨어진 비디오 클립 간의 관련성을 비교하고 (ii) 충분한 증거가 수집되면 처리를 중단할 수 있는 체계적인 메커니즘이 부족합니다. 본 논문에서는 MLLM의 자기 불확실성을 전역 제어 신호로 활용하여 장편 비디오 토큰 선택을 수행하는 훈련 불필요한 프레임워크인 AdaptToken을 제안합니다. AdaptToken은 비디오를 그룹으로 분할하고, 각 그룹 내의 토큰을 순위화하기 위해 크로스-모달 어텐션을 추출하며, 모델의 응답 엔트로피를 사용하여 각 그룹의 프롬프트 관련성을 추정합니다. 이 엔트로피 신호는 그룹 간의 전역 토큰 예산 할당을 가능하게 하며, 모델이 충분히 확신할 때 나머지 그룹을 건너뛰는 조기 종료(AdaptToken-Lite)를 지원합니다. VideoMME, LongVideoBench, LVBench, MLVU의 네 가지 장편 비디오 벤치마크와 다양한 MLLM(7B-72B)을 사용하여 실험한 결과, AdaptToken은 일관되게 정확도를 향상시켰습니다(예: Qwen2.5-VL 7B 모델에서 평균 +6.7). 또한 AdaptToken은 최대 10,000 프레임까지 매우 긴 입력을 처리하는 데에도 효과적이며, AdaptToken-Lite는 추론 시간을 약 절반으로 단축하면서도 동등한 성능을 유지합니다. 프로젝트 페이지: https://haozheqi.github.io/adapt-token

Original Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!