2602.20687v1 Feb 24, 2026 cs.AI

기초 기술이 VLM 기반 로봇 에이전트에 미치는 영향: 본래 환경에서의 관점

How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Jun Song

Citations: 421

h-index: 8

Bo Peng

Citations: 175

h-index: 4

Pi Bu

Citations: 139

h-index: 6

Keyu Pan

Citations: 34

h-index: 3

Xinrun Xu

Citations: 87

h-index: 6

Miao Chen

Citations: 101

h-index: 3

Yang Du

Citations: 25

h-index: 3

Lin Li

Citations: 402

h-index: 3

Tong Xu

Citations: 14

h-index: 2

Yinxiu Zhao

Citations: 2

h-index: 1

최근 비전-언어 모델(VLM)의 발전은 인간 수준의 로봇 지능을 구현할 가능성을 보여주었습니다. 그러나 VLM 기반 로봇 에이전트에 대한 기존 벤치마크는 종종 고수준 명령 또는 이산적인 행동 공간에 의존하며, 이는 실제 제어 환경과 크게 다른 비자연적인 설정입니다. 또한, 현재 벤치마크는 주로 고수준 작업에 초점을 맞추고 있으며, 저수준 및 고수준 모두에서의 통합적인 평가 및 분석이 부족합니다. 이러한 한계를 해결하기 위해, 우리는 VLM 기반 로봇 에이전트를 위한 도전적인 벤치마크인 NativeEmbodied를 제시합니다. NativeEmbodied는 통일된, 자연스러운 저수준 행동 공간을 사용하며, 다양한 시뮬레이션 환경을 기반으로 합니다. NativeEmbodied는 전체 성능을 평가하기 위해 복잡한 시나리오에서 세 가지 대표적인 고수준 작업을 포함합니다. 보다 자세한 분석을 위해, 우리는 복잡한 작업에 필요한 기술을 분리하고, 각 작업이 기본적인 로봇 기술을 목표로 하는 네 가지 유형의 저수준 작업을 구성했습니다. 작업 및 기술 수준 모두에서의 이러한 통합적인 평가는 로봇 에이전트에 대한 세밀한 평가를 가능하게 합니다. 최첨단 VLM을 사용한 실험 결과, 여러 기본적인 로봇 기술에서 명확한 결함이 발견되었으며, 추가 분석 결과 이러한 병목 현상이 고수준 작업의 성능을 크게 제한하는 것으로 나타났습니다. NativeEmbodied는 현재 VLM 기반 로봇 에이전트가 직면한 주요 과제를 강조하고, 향후 연구를 위한 통찰력을 제공합니다.

Original Abstract

Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.

2 Citations

1 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!