2604.14125v1 Apr 15, 2026 cs.CV

HiVLA: 시각 정보를 기반으로 한 계층적 임베디드 조작 시스템

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Jiangmiao Pang

Citations: 588

h-index: 10

Zhixuan Liang

Citations: 797

h-index: 10

Ping Luo

Citations: 638

h-index: 8

Yao Mu

Citations: 358

h-index: 2

Zanxin Chen

Citations: 531

h-index: 6

Chunpu Xu

Citations: 207

h-index: 5

Haotian Liang

Citations: 30

h-index: 4

Tianshuo Yang

Citations: 323

h-index: 9

Guanyu Chen

Citations: 55

h-index: 4

Yutian Chen

Citations: 12

h-index: 2

Yitian Liu

Citations: 17

h-index: 2

종단 간 시각-언어-행동(VLA) 모델은 로봇 조작에 대한 유망한 접근 방식을 제공하지만, 제한된 제어 데이터로 이러한 모델을 미세 조정하면 기본 시각-언어 모델(VLM)에서 상속된 심오한 추론 능력이 저하되는 경우가 많습니다. 이러한 근본적인 상충 관계를 해결하기 위해, 우리는 고수준의 의미론적 계획과 저수준의 모터 제어를 명시적으로 분리하는 시각 기반 중심 계층 구조 프레임워크인 HiVLA를 제안합니다. 고수준 부분에서, VLM 계획기는 먼저 작업 분해 및 시각적 연관 작업을 수행하여 하위 작업 지침과 정확한 대상 경계 상자를 포함하는 구조화된 계획을 생성합니다. 그런 다음, 이 계획을 실제 동작으로 변환하기 위해, 우리는 저수준 부분에 새로운 캐스케이드 크로스-어텐션 메커니즘이 장착된 Flow-Matching Diffusion Transformer (DiT) 액션 전문가를 도입합니다. 이러한 설계는 전역 컨텍스트, 고해상도 객체 중심 영역 및 기술 의미를 순차적으로 결합하여 DiT가 강력한 실행에만 집중할 수 있도록 합니다. 우리의 분리된 아키텍처는 VLM의 제로샷 추론 능력을 유지하면서 동시에 두 구성 요소의 독립적인 개선을 가능하게 합니다. 시뮬레이션 및 실제 환경에서의 광범위한 실험 결과, HiVLA는 최첨단 종단 간 모델보다 훨씬 뛰어난 성능을 보이며, 특히 장기적인 기술 조합 및 복잡한 환경에서 작은 물체의 정밀 조작에서 뛰어난 성능을 나타냅니다.

Original Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!