2605.26636v1 May 26, 2026 cs.CV

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Zhuoyang Zhang

Citations: 1,257

h-index: 10

Yao Lu

Citations: 1,233

h-index: 8

Hanrong Ye

Citations: 178

h-index: 6

Song Han

Citations: 1,714

h-index: 15

Dongyun Zou

Citations: 69

h-index: 4

Junyu Chen

Citations: 1,113

h-index: 11

Wenkun He

Citations: 67

h-index: 4

Qin Peng

Citations: 4

h-index: 1

Hongxu Yin

Citations: 71

h-index: 4

Han Cai

Citations: 3,410

h-index: 24

Yu Wang

Citations: 60

h-index: 2

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!