2603.02688v1 Mar 03, 2026 cs.AI

검색 기반 로봇: 검색-추론-행동을 통한 로봇 제어

Retrieval-Augmented Robots via Retrieve-Reason-Act

Diji Yang

Citations: 205

h-index: 7

Izat Temiraliev

Citations: 0

h-index: 0

Yi Zhang

Citations: 102

h-index: 5

일반적인 활용성을 달성하기 위해서는 로봇이 수동적인 실행자에서 능동적인 정보 검색 사용자로 진화해야 합니다. 사전 학습 데이터가 전혀 없는 환경(제로샷)에서, 로봇은 복잡한 가구 조립에 필요한 정확한 단계와 같은 중요한 정보 격차에 직면하며, 이는 로봇의 내부 파라미터 지식(상식)이나 과거 기억만으로는 해결할 수 없습니다. 최근의 로봇 연구에서는 행동 전에 검색을 활용하려는 시도가 있지만, 주로 과거의 운동 경로(내부 기억 검색과 유사) 또는 텍스트 기반의 안전 규칙(제약 조건 검색)을 검색하는 데 초점을 맞추고 있습니다. 이러한 접근 방식은 능동적인 작업 수행에 필요한 핵심 정보, 즉 외부의 비정형 문서에서 얻을 수 없는 절차적 지식을 습득하는 데는 실패합니다. 본 논문에서는 이러한 패러다임을 검색 기반 로봇(Retrieval-Augmented Robotics, RAR)으로 정의하며, 로봇에게 시각적 문서와 물리적 작동 사이의 간극을 해소하는 정보 검색 능력을 부여합니다. 우리는 작업 수행을 반복적인 검색-추론-행동 루프로 정의합니다. 로봇 또는 에이전트는 비정형 데이터베이스에서 관련 시각적 절차 매뉴얼을 능동적으로 검색하고, 추상적인 2D 다이어그램을 크로스 모달 정렬을 통해 3D 물리적 부품에 연결하고, 실행 가능한 계획을 생성합니다. 우리는 이 패러다임을 어려운 장기 조립 벤치마크에서 검증했습니다. 실험 결과, 검색된 시각적 문서에 기반한 로봇 계획은 제로샷 추론 또는 소량의 예제 검색에 의존하는 기존 방식보다 훨씬 뛰어난 성능을 보였습니다. 본 연구는 정보 검색의 범위를 사용자 쿼리에 답하는 것에서 벗어나, 물리적인 행동을 유도하는 방식으로 확장하는 RAR의 기반을 확립합니다.

Original Abstract

To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!