2603.28610v2 Mar 30, 2026 cs.CV

ResAdapt: 효율적인 다중 모드 추론을 위한 적응형 해상도

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Kun Xu

Citations: 36

h-index: 3

Zhongtao Jiang

Citations: 213

h-index: 6

Shizhu He

Citations: 7,961

h-index: 30

Jun Zhao

Citations: 1,187

h-index: 13

Kang Liu

Citations: 298

h-index: 11

Yupu Hao

Citations: 40

h-index: 5

Huanxuan Liao

Institute of Automation, Chinese Academy of Sciences

Citations: 205

h-index: 6

Yuqiao Tan

Citations: 42

h-index: 4

다중 모드 대규모 언어 모델(MLLM)은 입력 데이터의 충실도를 높여 시각적 이해 능력을 향상시키지만, 그 결과 발생하는 시각적 토큰의 증가로 인해 높은 공간 해상도와 긴 시간적 맥락을 동시에 유지하는 것은 어렵습니다. 본 연구에서는 병목 현상이 인코딩 후 표현을 압축하는 방식이 아니라, 인코더가 받는 픽셀 수에 있다는 점을 지적하고, ResAdapt라는 입력 측면의 적응 프레임워크를 제안합니다. ResAdapt는 가벼운 할당기(Allocator)와 변경되지 않은 MLLM 기반 구조를 결합하여, 기반 구조가 원래의 시각적 토큰 인터페이스를 유지하면서도 변환된 입력을 받도록 합니다. 할당 문제를 문맥 밴디트(contextual bandit) 문제로 정의하고, 비용 인지 정책 최적화(Cost-Aware Policy Optimization, CAPO)를 사용하여 할당기를 학습합니다. CAPO는 희소한 실행 결과를 안정적인 정확도-비용 학습 신호로 변환합니다. 예산 제약이 있는 비디오 질의 응답, 시간적 위치 설정, 이미지 추론 작업에서 ResAdapt는 낮은 예산 환경에서 성능을 향상시키고, 종종 효율성-정확도 경계선 근처에 위치하며, 특히 압축률이 높을수록 추론 성능이 뛰어납니다. 주목할 만한 점은 ResAdapt가 동일한 시각적 예산으로 최대 16배 더 많은 프레임을 처리하면서 15% 이상의 성능 향상을 제공한다는 것입니다. 관련 코드는 https://github.com/Xnhyacinth/ResAdapt 에서 확인할 수 있습니다.

Original Abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

1 Citations

0 Influential

45.986122886681 Altmetric

230.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!