2603.28069v1 Mar 30, 2026 cs.CV

MolmoPoint: Grounding 토큰을 활용한 VLM의 향상된 지시 기능

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Yue Yang

Citations: 13

h-index: 2

Ranjay Krishna

Citations: 853

h-index: 18

Christopher Clark

Citations: 44

h-index: 1

Jieyu Zhang

Citations: 59

h-index: 2

Zixian Ma

Citations: 772

h-index: 13

J. Park

Citations: 632

h-index: 4

Mohammadreza Salehi

Citations: 651

h-index: 6

Rohun Tripathi

Citations: 655

h-index: 7

Sangho Lee

Citations: 153

h-index: 2

Winson Han

Citations: 52

h-index: 2

Taira Anderson

Citations: 2

h-index: 1

그라운딩(grounding)은 비전-언어 모델(VLM)의 기본적인 기능으로 자리 잡았습니다. 기존의 대부분 VLM은 텍스트 출력의 일부로 좌표를 생성하여 지시 기능을 수행하는데, 이는 복잡한 좌표 시스템을 학습해야 하며 토큰 수를 증가시키는 단점이 있습니다. 본 연구에서는 더 직관적인 지시 메커니즘을 제안합니다. 이는 대상 개념을 포함하는 시각적 토큰을 직접 선택하는 방식입니다. 제안하는 모델은 입력 이미지 또는 비디오 토큰과 크로스 어텐션(cross-attention)을 수행하여 적절한 토큰을 선택하는 특별한 지시 토큰을 생성합니다. 모델의 정밀도를 높이기 위해, 먼저 선택된 영역 내에서 더 세분화된 하위 영역을 선택하는 추가적인 특별 토큰을 사용하고, 그 하위 영역 내의 특정 위치를 지정하는 세 번째 토큰을 사용합니다. 또한, 점을 일관된 순서로 순차적으로 생성하고, 이전에 선택된 점의 상대적인 위치를 인코딩하며, 시각적 토큰을 선택할 때 특별한 '더 이상 점 없음' 클래스를 포함함으로써 성능이 향상됨을 보여줍니다. 제안하는 방법은 이미지 지시 분야에서 새로운 최고 성능(PointBench에서 70.7%)을 달성하고, 완전 공개 모델 중에서 GUI 지시 분야에서도 새로운 최고 성능(ScreenSpotPro에서 61.1%)을 달성했으며, 비디오 지시(텍스트 좌표 기준 대비 59.1%의 인간 선호도 우세) 및 추적 성능(+6.3%의 Molmo2Track 향상)을 향상시켰습니다. 또한, 제안하는 방법이 훨씬 더 높은 샘플 효율성을 달성하며, 이러한 설계 변경으로 인해 발생하는 질적인 차이점에 대해 논의합니다.

Original Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

1 Citations

0 Influential

9 Altmetric

46.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!