2604.01618v1 Apr 02, 2026 cs.CV

Tex3D: 시각-언어-행동 모델을 위한 적대적 3D 텍스처를 활용한 객체를 공격 표면으로 활용

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Mingjie Wei

Citations: 63

h-index: 3

Zhaoxia Yin

Citations: 100

h-index: 5

Siming Huang

Citations: 14

h-index: 2

Shuaihang Chen

Citations: 69

h-index: 2

Yu Tian

Citations: 11

h-index: 1

Chao-Liang Yu

Citations: 9

h-index: 2

Jiawei Chen

Citations: 243

h-index: 10

Jiawei Du

Citations: 3

h-index: 1

시각-언어-행동(VLA) 모델은 로봇 조작 분야에서 뛰어난 성능을 보여주었지만, 물리적으로 구현 가능한 적대적 공격에 대한 견고성은 아직 충분히 연구되지 않았습니다. 기존 연구에서는 언어 변화 및 2D 시각적 공격을 통해 취약점을 발견했지만, 이러한 공격 표면은 실제 환경에서의 적용 가능성이 낮거나 물리적 현실성이 제한적입니다. 반면, 적대적 3D 텍스처는 조작되는 객체에 자연스럽게 부착되어 물리적 환경에 쉽게 적용될 수 있으며, 따라서 더 현실적이고 파괴적인 위협이 될 수 있습니다. 그러나 적대적 3D 텍스처를 VLA 시스템에 적용하는 것은 쉬운 일이 아닙니다. 주요 장애물은 표준 3D 시뮬레이터가 VLA 목표 함수에서 객체의 외관으로의 미분 가능한 최적화 경로를 제공하지 않아, 엔드 투 엔드 방식으로 최적화를 수행하기 어렵다는 점입니다. 이를 해결하기 위해, 우리는 원래 시뮬레이션 환경을 유지하면서 이중 렌더러 정렬을 통해 미분 가능한 텍스처 최적화를 가능하게 하는 '전경-배경 분리(FBD)' 기술을 제안합니다. 또한, 실제 세계에서의 장기적인 시퀀스와 다양한 시점에서 공격이 효과를 유지하도록, 행동적으로 중요한 프레임을 우선시하고 정점 기반 파라미터화를 사용하여 최적화를 안정화하는 '경로 인식 적대적 최적화(TAAO)'를 제안합니다. 이러한 설계 방안을 바탕으로, 우리는 VLA 시뮬레이션 환경 내에서 3D 적대적 텍스처를 엔드 투 엔드 방식으로 최적화하는 첫 번째 프레임워크인 Tex3D를 제시합니다. 시뮬레이션 환경 및 실제 로봇 환경에서의 실험 결과는 Tex3D가 다양한 조작 작업에서 VLA 성능을 크게 저하시키며, 최대 96.7%의 작업 실패율을 달성한다는 것을 보여줍니다. 우리의 실험 결과는 VLA 시스템이 물리적으로 구현 가능한 3D 적대적 공격에 취약하다는 것을 보여주며, 견고성을 고려한 학습의 필요성을 강조합니다.

Original Abstract

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

1 Citations

1 Influential

5 Altmetric

28.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!