2602.21198v1 Feb 24, 2026 cs.LG

실패와 시행착오를 통한 학습: 구체화된 LLM을 위한 반사적 테스트 시간 계획

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong

Citations: 69

h-index: 4

Huang Huang

Citations: 45

h-index: 4

Manling Li

Citations: 651

h-index: 9

Fei-Fei Li

Citations: 1,380

h-index: 15

Jiajun Wu

Citations: 578

h-index: 11

Yejin Choi

Citations: 137

h-index: 4

구체화된 LLM은 로봇에게 고차원적인 작업 추론 능력을 제공하지만, 무엇이 잘못되었는지 또는 왜 잘못되었는지 스스로 되돌아볼 수 없기 때문에, 배포 과정은 독립적인 시행착오의 연속이 되어 실수가 경험으로 축적되기보다는 반복되는 경향이 있습니다. 본 연구는 인간의 반성적 실천가로부터 영감을 받아, 두 가지 형태의 반성을 통합하는 반사적 테스트 시간 계획(Reflective Test-Time Planning)을 소개합니다. 첫째, 실행 전에 내부적인 반사를 통해 여러 후보 행동을 생성하고 평가하는 '실행 중 반성(reflection-in-action)'이며, 둘째, 실행 후에 외부적인 반사를 기반으로 내부 반성 모델과 행동 정책을 업데이트하는 '실행 후 반성(reflection-on-action)'입니다. 또한, '사후 반성(retrospective reflection)'을 통해 에이전트는 이전의 결정을 재평가하고, 장기적인 보상 할당을 위해 과거의 데이터를 활용하여 모델을 업데이트할 수 있습니다. 새로 설계한 장기 작업 가정 환경(Long-Horizon Household) 및 MuJoCo 캐비닛 조립 환경(Cupboard Fitting)에서의 실험 결과, 제안하는 방법이 기존 모델에 비해 상당한 성능 향상을 보였으며, ablation study를 통해 '실행 중 반성'과 '실행 후 반성'의 상호 보완적인 역할을 검증했습니다. 실제 로봇 실험을 포함한 질적 분석 결과, 반성을 통해 행동을 교정하는 과정을 확인할 수 있었습니다.

Original Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

5 Citations

1 Influential

7.5 Altmetric

44.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!