2601.17885v1 Jan 25, 2026 cs.CV

PEAfowl: 지각 능력 향상된 다중 시점 기반 시각-언어-행동 모델을 이용한 양손 조작

PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Qingyu Fan

Citations: 7

h-index: 2

Zhaoxiang Li

Citations: 96

h-index: 5

Yi Lu

Citations: 10

h-index: 2

Wang Chen

Citations: 65

h-index: 5

Qiu Shen

Citations: 41

h-index: 2

Xiao-xiao Long

Citations: 65

h-index: 3

Yinghao Cai

Citations: 1,294

h-index: 19

Tao Lu

Citations: 398

h-index: 11

Shuo Wang

Citations: 593

h-index: 9

Xun Cao

Citations: 107

h-index: 5

혼잡한 환경에서의 양손 조작은 가려짐, 시점 변화 및 장면 변화에도 안정적인 정책을 요구합니다. 기존의 시각-언어-행동 모델은 종종 다음과 같은 이유로 인해 일반화에 실패합니다. (i) 다중 시점 특징은 시점 불변 토큰 연결을 통해 융합되므로, 3차원 공간 이해력이 약하고, (ii) 언어 정보는 전역 조건부 입력으로 주입되어, 세부적인 명령 해석이 어렵습니다. 본 논문에서는 지각 능력이 향상된 다중 시점 시각-언어-행동 모델인 PEAfowl을 제안합니다. 공간 추론을 위해 PEAfowl은 각 토큰별 깊이 분포를 예측하고, 미분 가능한 3차원 리프팅을 수행하며, 국소적인 다중 시점 이웃 정보를 집계하여 기하학적으로 타당하고 다중 시점 일관성을 갖는 표현을 생성합니다. 명령 해석을 위해, 우리는 전역 조건부 입력 대신, 사전 학습된 CLIP 시각 특징에 대한 Perceiver 스타일의 텍스트 기반 읽기 방식을 제안하여, 반복적인 증거 축적을 가능하게 합니다. 노이즈가 많고 불완전한 상용 깊이 정보를 추가적인 추론 오버헤드 없이 처리하기 위해, 사전 학습된 깊이 예측 모델로부터 얻은 깊이 정보를 사용하여 깊이 분포 예측 헤드를 지도 학습하여, 지각 전처리 단계에 기하학적 정보를 제공합니다. RoboTwin 2.0 환경에서 도메인 랜덤화를 적용한 실험 결과, PEAfowl은 최고 성능 모델보다 성공률이 23.0%p 향상되었으며, 실제 로봇 실험 결과는 신뢰할 수 있는 시뮬레이션-실제 환경 전이와 깊이 정보 증류를 통한 일관된 성능 향상을 보여줍니다. 프로젝트 웹사이트: https://peafowlvla.github.io/.

Original Abstract

Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.

2 Citations

0 Influential

9.5 Altmetric

49.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!