2602.13313v1 Feb 10, 2026 cs.CV

협력적 추론을 통한 주체적인 시공간적 정합

Agentic Spatio-Temporal Grounding via Collaborative Reasoning

Y. Ong

Citations: 288

h-index: 9

Heng Zhao

Citations: 11

h-index: 2

J. Zhou

Citations: 70

h-index: 3

시공간적 비디오 정합(STVG)은 주어진 텍스트 질의에 따라 비디오 내의 특정 객체 또는 인물의 시공간적 영역을 찾아내는 것을 목표로 합니다. 기존의 대부분 방법은 예측된 시간 범위 내에서 프레임별 공간적 위치를 파악하며, 이는 중복 계산, 과도한 지도 학습 요구, 그리고 제한적인 일반화 성능을 초래합니다. 약하게 지도 학습된 방법은 어노테이션 비용을 줄이지만, 데이터셋 수준의 훈련 및 최적화 방식에 갇혀 성능이 떨어지는 문제가 있습니다. 이러한 문제점을 해결하기 위해, 우리는 개방형 환경 및 훈련이 필요 없는 시나리오에서 STVG 작업을 위한 주체적인 시공간 정합기(ASTG) 프레임워크를 제안합니다. 구체적으로, 현대적인 멀티모달 대규모 언어 모델(MLLM)을 활용하여 구축된 두 개의 특수 에이전트, 즉 공간 추론 에이전트(SRA)와 시간 추론 에이전트(TRA)가 협력하여 목표 영역을 자율적이고 자기 주도적인 방식으로 찾아냅니다. ASTG는 제안-평가 방식을 따르며, 시공간적 추론을 분리하고 영역 추출, 검증 및 시간적 위치 파악 과정을 자동화합니다. 전용 시각 메모리와 대화 맥락을 통해 검색 효율성을 크게 향상시킵니다. 인기 있는 벤치마크에서의 실험 결과는 제안된 방법이 기존의 약하게 지도 학습된 방법 및 제로샷 방법보다 우수한 성능을 보이며, 일부 완전하게 지도 학습된 방법과 동등한 수준의 성능을 달성한다는 것을 보여줍니다.

Original Abstract

Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!