2604.11136v1 Apr 13, 2026 cs.CV

BoxTuning: 객체 박스를 직접 주입하여 다중 모드 모델을 미세 조정하는 방법

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Zekun Qian

Citations: 94

h-index: 5

Ruize Han

Citations: 1,186

h-index: 16

Wei Feng

Citations: 39

h-index: 4

비디오 질의 응답에서 객체 수준의 공간-시간적 이해는 필수적이지만, 기존의 다중 모드 대규모 언어 모델(MLLM)은 프레임을 전체적으로 처리하며, 객체에 대한 세밀한 정보 표현을 위한 명시적인 메커니즘이 부족합니다. 최근 연구에서는 바운딩 박스 좌표를 텍스트 토큰으로 직렬화하는 방법을 사용했지만, 이 텍스트-좌표 패러다임은 근본적인 모달리티 불일치를 겪습니다. 객체 정보는 본질적으로 시각적인 정보이지만, 이를 텍스트로 인코딩하면 높은 토큰 비용이 발생하여 공격적인 시간 축소 처리가 필요하기 때문입니다. 우리는 BoxTuning을 제안합니다. BoxTuning은 객체의 공간-시간적 정보를 시각 모달리티에 직접 주입하여 이러한 불일치를 해결합니다. 색상으로 표시된 바운딩 박스와 경로 추적선이 비디오 프레임에 시각적 프롬프트로 렌더링되며, 간결한 색상-객체 레전드만 텍스트로 유지됩니다. 이를 통해 토큰 비용을 크게 줄여 실제적으로 87-93%의 텍스트 토큰 감소를 달성합니다. 또한, 전체 시간 해상도를 유지하며, 경로 추적선은 각 주요 프레임 내에서 프레임 간의 움직임 방향과 속도를 추가적으로 인코딩하여 텍스트-좌표 방법이 버려야 하는 세밀한 동적 정보를 복구합니다. 다섯 가지 비디오 질의 응답 벤치마크(CLEVRER, Perception Test, STAR, NExT-QA, IntentQA)에 대한 실험 결과, BoxTuning은 공간적 정보를 활용하는 작업에서 텍스트-좌표 기반 모델보다 뛰어난 성능을 보였으며, 추론 중심적인 작업에서 관찰되는 정확도 저하를 거의 완화했습니다. 이는 시각적 프롬프팅이 비디오 MLLM에 객체 정보를 전달하는 더욱 자연스럽고 효율적인 패러다임임을 보여줍니다.

Original Abstract

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!