2601.11269v1 Jan 16, 2026 cs.CV

X-Distill: 시각-운동 학습을 위한 교차 아키텍처 기반 시각 지식 증류

X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Maanping Shao

Citations: 18

h-index: 2

Feihong Zhang

Citations: 18

h-index: 2

Gu Zhang

Citations: 197

h-index: 4

Baiye Cheng

Citations: 11

h-index: 1

Zhengrong Xue

Citations: 193

h-index: 7

Huazhe Xu

Citations: 219

h-index: 8

시각-운동 정책은 종종 강력한 일반화 능력을 갖춘 대규모 사전 학습된 비전 트랜스포머(ViT)를 활용합니다. 그러나 이러한 모델은 상당한 양의 데이터를 필요로 하며, 이는 대부분의 로봇 학습 환경에서 데이터가 부족한 상황에서 큰 과제입니다. 이러한 환경에서는 강력한 귀납 편향을 가진 소형 CNN이 더 쉽게 최적화될 수 있습니다. 이러한 상충 관계를 해결하기 위해, 우리는 X-Distill이라는 간단하면서도 매우 효과적인 방법을 제안합니다. 이 방법은 두 가지 아키텍처의 장점을 결합합니다. 우리의 접근 방식은 오프라인 방식으로, 일반적인 ImageNet 데이터셋에서 대규모의 고정된 DINOv2 모델(선생 모델)로부터 소형 ResNet-18 모델(학생 모델)로 풍부한 시각적 표현을 전달하는 교차 아키텍처 기반 지식 증류를 포함합니다. 이렇게 증류된 인코더는 이제 강력한 시각적 사전 지식을 갖게 되며, 이는 대상 조작 작업에서 디퓨전 정책 헤드와 함께 공동으로 미세 조정됩니다. 34개의 시뮬레이션 벤치마크와 5개의 어려운 실제 작업에 대한 광범위한 실험 결과, 우리 방법은 처음부터 학습된 ResNet 또는 미세 조정된 DINOv2 인코더를 사용하는 정책보다 일관되게 우수한 성능을 보입니다. 주목할 만한 점은 X-Distill이 특권적인 포인트 클라우드 관찰을 사용하거나 훨씬 더 큰 비전-언어 모델을 사용하는 3D 인코더보다도 우수한 성능을 보인다는 것입니다. 우리의 연구는 데이터 효율적인 로봇 조작에서 최첨단 성능을 달성하기 위한 간단하고 잘 설계된 증류 전략의 효능을 강조합니다.

Original Abstract

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!