2602.08167v1 Feb 09, 2026 cs.RO

행동 예측 기반의 에이전트 추론을 위한 자기 지도 부트스트래핑

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Marco Pavone

Citations: 178

h-index: 7

Milan Ganai

Citations: 11

h-index: 1

Katie Luo

Citations: 1,502

h-index: 9

Jonas Frey

Citations: 381

h-index: 10

Clark W. Barrett

Citations: 314

h-index: 7

에이전트 연쇄 추론(Chain-of-Thought, CoT)은 비전-언어-행동(Vision-Language-Action, VLA) 모델의 성능을 크게 향상시켰지만, 현재 방법은 추론의 기본 요소를 지정하기 위해 엄격한 템플릿에 의존합니다(예: 장면 내 객체, 상위 수준 계획, 구조적 가능성). 이러한 템플릿은 정책이 관련 없는 정보를 처리하도록 강제하여 중요한 행동 예측 신호에 대한 주의를 분산시킬 수 있습니다. 이는 다음과 같은 병목 현상을 야기합니다. 성공적인 정책이 없으면 추론의 품질을 검증할 수 없고, 품질이 낮은 추론으로는 견고한 정책을 구축할 수 없습니다. 우리는 R&B-EnCoRe를 소개하며, 이는 모델이 자기 지도적 개선을 통해 인터넷 규모의 지식을 활용하여 에이전트 추론을 발전시킬 수 있도록 합니다. 모델은 중요도 가중 변분 추론 내에서 추론을 잠재 변수로 취급함으로써, 외부 보상, 검증기 또는 인간 주석 없이 에이전트 특유의 전략을 생성하고 정제된 추론 학습 데이터 세트를 구축할 수 있습니다. 우리는 R&B-EnCoRe를 시뮬레이션 환경의 Franka Panda 로봇, 실제 하드웨어 기반의 WidowX 로봇, 그리고 2족 보행, 바퀴 달린 로봇, 자전거, 4족 보행 로봇, 자율 주행 로봇 등 다양한 에이전트 환경에서 1B, 4B, 7B 및 30B 파라미터를 가진 다양한 VLA 아키텍처를 사용하여 검증했습니다. 우리의 접근 방식은 조작 작업 성공률을 28% 향상시키고, 내비게이션 점수를 101% 향상시키며, 충돌률 지표를 21% 감소시켰습니다. R&B-EnCoRe는 모델이 성공적인 제어를 예측하는 추론을 추출할 수 있도록 하여, 수동 주석 엔지니어링을 우회하면서 인터넷 규모의 지식을 물리적 실행에 연결합니다.

Original Abstract

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!