2603.13433v1 Mar 13, 2026 cs.RO

자연 환경에서의 공간 정보를 활용한 장기 작업 계획 수립

Spatially Grounded Long-Horizon Task Planning in the Wild

Reuben Tan

Citations: 1,013

h-index: 14

Yong Jae Lee

Citations: 4

h-index: 1

Sehun Jung

Citations: 31

h-index: 3

Hyunjee Song

Citations: 1

h-index: 1

Donghyun Kim

Citations: 2

h-index: 1

Jianfeng Gao

Citations: 80

h-index: 4

최근 로봇 조작 분야에서 비전-언어 모델(VLMs)이 고차원적인 추론에 활용되면서, 작업 지시를 자연어 표현의 순차적 행동 계획으로 분해하여 하위 수준의 모터 제어를 돕고 있습니다. 그러나 현재의 벤치마크는 이러한 계획들이 실제로 공간적으로 실행 가능한지, 특히 로봇이 계획을 수행하기 위해 정확히 어떤 위치에서 상호 작용해야 하는지를 평가하지 못하고 있어, 실제 환경에서의 조작 능력을 평가하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 공간 정보를 활용한 계획 수립이라는 새로운 과제를 정의하고, 자연 환경에서의 장기 행동 계획 수립을 위한 새로운 벤치마크인 GroundedPlanBench를 소개합니다. GroundedPlanBench는 계층적 하위 행동 계획과 공간적 행동 결정(어디에서 행동해야 하는지)을 함께 평가하여, 생성된 하위 행동들이 로봇 조작을 위해 공간적으로 실행 가능한지 체계적으로 평가할 수 있도록 합니다. 또한, 실제 로봇의 비디오 데모를 활용하여 공간 정보를 활용한 장기 계획 수립을 개선하는 자동 데이터 생성 프레임워크인 Video-to-Spatially Grounded Planning (V2GP)을 소개합니다. 우리의 실험 결과는 공간 정보를 활용한 장기 계획 수립이 현재의 VLMs에게 여전히 중요한 과제임을 보여줍니다. V2GP는 행동 계획 및 공간적 정밀도 성능을 향상시키는 유망한 접근 방식이며, 우리의 벤치마크와 실제 로봇 조작 실험을 통해 그 효과가 입증되었으며, 공간 정보를 활용한 실행 가능한 계획 수립 기술 발전에 기여할 것입니다.

Original Abstract

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!