2602.14615v1 Feb 16, 2026 cs.CV

VariViT: 다양한 이미지 크기에 적합한 비전 트랜스포머

VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma

Citations: 7

h-index: 2

Suprosanna Shit

Citations: 2,370

h-index: 23

Chinmay Prabhakar

Citations: 226

h-index: 9

Daniel Scholz

Citations: 55

h-index: 3

H. Li

Citations: 296

h-index: 9

Bjoern H Menze

Unversity of Zurich

Citations: 29,793

h-index: 70

Daniel Rueckert

Citations: 610

h-index: 13

Benedikt Wiestler

Citations: 6,222

h-index: 30

비전 트랜스포머(ViT)는 자기 주의(self-attention) 메커니즘을 활용하여 다양한 작업에서 뛰어난 성능을 보이는 최첨단 표현 학습 아키텍처로 부상했습니다. ViT는 이미지를 고정된 크기의 패치로 분할하는데, 이는 미리 정의된 크기에 제약을 가하며, 리사이징, 패딩 또는 크롭과 같은 사전 처리 단계를 필요로 합니다. 이는 특히 종양과 같이 불규칙한 형태의 구조를 가진 의료 영상 분야에서 어려움을 야기합니다. 고정된 바운딩 박스 크기를 사용한 크롭은 입력 이미지에서 매우 다양한 전경-배경 비율을 초래합니다. 의료 이미지를 리사이징하면 정보가 손실되고 인공물이 발생하여 진단에 영향을 미칠 수 있습니다. 따라서, 관심 영역에 맞는 다양한 크기의 크롭을 사용하면 특징 표현 능력을 향상시킬 수 있습니다. 또한, 큰 이미지는 계산 비용이 많이 들고, 작은 크기는 정보 손실의 위험이 있어 계산-정확도 간의 균형을 맞춰야 합니다. 본 논문에서는 다양한 이미지 크기를 처리하면서 일관된 패치 크기를 유지하도록 설계된 개선된 ViT 모델인 VariViT를 제안합니다. VariViT는 가변적인 수의 패치에 대한 새로운 위치 임베딩 리사이징 방식을 사용합니다. 또한, VariViT 내에서 새로운 배치 전략을 구현하여 계산 복잡성을 줄여 학습 및 추론 시간을 단축했습니다. 두 가지 3D 뇌 MRI 데이터셋에 대한 평가 결과, VariViT는 일반적인 ViT 및 ResNet보다 글리오마 유전자형 예측 및 뇌종양 분류에서 더 우수한 성능을 보였습니다. 각각 75.5% 및 76.3%의 F1 점수를 달성하여 더 구별적인 특징을 학습했습니다. 제안된 배치 전략은 기존 아키텍처에 비해 최대 30%의 계산 시간을 단축했습니다. 이러한 결과는 이미지 표현 학습에서 VariViT의 효과를 강조합니다. 저희 코드의 위치는 다음과 같습니다: https://github.com/Aswathi-Varma/varivit

Original Abstract

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit

2 Citations

0 Influential

63.862943611199 Altmetric

321.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!