2601.21570v1 Jan 29, 2026 cs.AI

EmboCoach-Bench: 엠바디드 로봇 개발을 위한 AI 에이전트 벤치마킹

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

Zixing Lei

Citations: 674

h-index: 6

Genjia Liu

Citations: 32

h-index: 2

Yuanshuo Zhang

Citations: 13

h-index: 2

Wenzhao Lian

Citations: 767

h-index: 14

Shanghang Zhang

Citations: 72

h-index: 4

Chuan Wen

Citations: 194

h-index: 4

Qipeng Liu

Citations: 32

h-index: 4

Siheng Chen

Citations: 11

h-index: 2

엠바디드 AI 분야는 고충실도 시뮬레이션과 대규모 데이터 수집에 힘입어 범용 로봇 시스템을 향해 빠르게 진화하고 있습니다. 그러나 이러한 확장 능력은 이질적인 백엔드 전반에 걸쳐 복잡한 보상 설계(reward shaping)부터 하이퍼파라미터 튜닝에 이르기까지, 노동 집약적인 수동 감독에 의존해야 하는 문제로 인해 심각한 병목 현상을 겪고 있습니다. 소프트웨어 자동화 및 과학적 발견 분야에서 거둔 LLM의 성공에 영감을 받아, 우리는 LLM 에이전트가 엠바디드 정책(policy)을 자율적으로 엔지니어링하는 능력을 평가하는 벤치마크인 EmboCoach-Bench를 소개합니다. 전문가가 엄선한 32개의 강화 학습(RL) 및 모방 학습(IL) 작업을 포괄하는 우리의 프레임워크는 실행 가능한 코드를 보편적 인터페이스로 상정합니다. 우리는 단순한 정적 코드 생성을 넘어 동적인 폐루프(closed-loop) 워크플로우를 평가하는데, 여기서 에이전트는 환경 피드백을 활용하여 물리 기반 보상 설계부터 확산 정책(diffusion policies)과 같은 정책 아키텍처 개선에 이르기까지 솔루션을 반복적으로 초안 작성, 디버깅 및 최적화합니다. 광범위한 평가를 통해 세 가지 중요한 통찰을 얻었습니다. (1) 자율 에이전트는 평균 성공률에서 인간이 설계한 베이스라인을 26.5% 상회할 수 있습니다. (2) 환경 피드백을 갖춘 에이전트 워크플로우는 정책 개발을 효과적으로 강화하며 오픈 소스 모델과 독점 모델 간의 성능 격차를 상당히 좁힙니다. (3) 에이전트는 병적인 엔지니어링 사례에 대해 자가 수정 능력을 보여주며, 반복적인 시뮬레이션 기반 디버깅을 통해 거의 완전한 실패 상태에서 작업 성능을 성공적으로 회복시킵니다. 궁극적으로 이 연구는 자가 진화하는 엠바디드 지능을 위한 토대를 마련하여, 엠바디드 AI 분야에서 노동 집약적인 수동 튜닝으로부터 확장 가능한 자율 엔지니어링으로의 패러다임 전환을 가속화합니다.

Original Abstract

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

2 Citations

2 Influential

7 Altmetric

41.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!