2603.18856v1 Mar 19, 2026 cs.CV

Motion-o: 궤적 기반 동영상 추론

Motion-o: Trajectory-Grounded Video Reasoning

Bishoy M. Galoaa

Citations: 21

h-index: 3

Shayda Moezzi

Citations: 3

h-index: 1

Xiangyu Bai

Citations: 41

h-index: 3

Sarah Ostadabbas

Citations: 17

h-index: 3

최근 연구에서 동영상 추론 분야는 상당한 발전을 이루었으며, 많은 모델들이 공간-시간적 증거 연결망을 활용하여 추론 능력을 강화하고 있습니다. 동시에, 구조화된 주석을 제공하여 이러한 추론을 지원하고 평가할 수 있도록 설계된 데이터셋과 벤치마크가 증가하고 있습니다. 그러나 관찰 간 객체의 이동 방식에 대한 추론은 상대적으로 간과되어 왔습니다. 기존 연구에서는 연속적인 관찰을 연결하여 움직임 패턴을 명시적으로 설명하지 않았으며, 결과적으로 궤적 이해는 암묵적이며 검증하기 어렵습니다. 본 연구에서는 이러한 부족한 부분을 공간-시간-궤적(STT) 추론으로 공식화하고, 궤적을 명시적이고 검증 가능하게 만드는 시각 언어 모델의 동영상 이해 확장 모델인 **Motion-o**를 제안합니다. 궤적 추론을 가능하게 하기 위해, 본 연구에서는 희소한 키프레임 감독 방식을 증강을 통해 더욱 밀집된 바운딩 박스 추적 정보를 제공하고 궤적 수준의 학습 신호를 강화하는 데이터셋 구성 방식을 도입했습니다. 또한, 객체 궤적을 명확하게 연결하기 위해, 개별 객체의 방향, 속도 및 속도 변화를 요약하는 `<b><motion/></b>` 태그를 사용하여 구조화된 추론 경로인 Motion Chain of Thought (MCoT)를 제안합니다. Motion-o를 학습하기 위해, 모델이 시각적 증거를 직접적으로 추론하도록 유도하는 보상 함수를 설계했으며, 이를 위해 모델 구조의 변경은 필요하지 않습니다. 실험 결과는 Motion-o가 공간-시간적 연결성과 궤적 예측 성능을 향상시키며, 기존 프레임워크와 완벽하게 호환됨을 보여줍니다. 이를 통해 동영상 이해에서 궤적 추론이 중요한 확장 요소임을 입증합니다. 코드: https://github.com/ostadabbas/Motion-o

Original Abstract

Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{<motion/>} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

0 Citations

0 Influential

24.9657359028 Altmetric

124.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!