2602.16918v1 Feb 18, 2026 cs.CV

Xray-Visual 모델: 산업 규모 데이터 기반의 비전 모델 확장

Xray-Visual Models: Scaling Vision models on Industry Scale Data

Arkabandhu Chowdhury

Citations: 14,618

h-index: 6

Jun Xiao

Citations: 99

h-index: 4

Tsung-Yu Lin

Citations: 100

h-index: 2

Linda Wang

Citations: 3,609

h-index: 7

Hongli Xu

Citations: 51

h-index: 4

Yiming Liu

Citations: 49

h-index: 3

M. Hsu

Citations: 0

h-index: 0

Chaitanya Ahuja

Citations: 51

h-index: 3

Hao Yuan

Citations: 42

h-index: 5

Jianpeng Cheng

Citations: 23

h-index: 2

Hong-you Chen

Citations: 26

h-index: 2

Hao Xu

Citations: 36

h-index: 4

Chao Li

Citations: 906

h-index: 4

Abhijeet Awasthi

Citations: 20

h-index: 2

Don Husa

Citations: 288

h-index: 1

Michael Ge

Citations: 91

h-index: 2

Sumedha Singla

Citations: 203

h-index: 6

Phong Dingh

Citations: 0

h-index: 0

Satya Narayan Shukla

Citations: 1,339

h-index: 14

Yonghuan Yang

Citations: 48

h-index: 4

David Jacobs

Citations: 152

h-index: 4

Qi Guo

Citations: 130

h-index: 4

Xiangjun Fan

Citations: 87

h-index: 2

Jihye Moon

Citations: 11

h-index: 1

Shlok Kumar Mishra

Univesity of Maryland

Citations: 981

h-index: 14

Aashu Singh

Citations: 246

h-index: 5

본 논문에서는 대규모 이미지 및 비디오 이해를 위해 설계된 통합 비전 모델 아키텍처인 Xray-Visual을 소개합니다. 저희 모델은 페이스북 및 인스타그램에서 수집된 150억 개 이상의 큐레이션된 이미지-텍스트 쌍과 100억 개의 비디오-해시태그 쌍을 활용하며, 의미적 다양성을 극대화하고 레이블 노이즈를 최소화하기 위해 데이터 큐레이션 파이프라인을 통해 균형 조정 및 노이즈 제거 전략을 적용했습니다. 저희는 자기 지도 학습(MAE), 준지도 학습 해시태그 분류, 그리고 CLIP 스타일의 대조 학습을 결합한 3단계 학습 파이프라인을 도입하여 이미지 및 비디오 모달리티를 동시에 최적화합니다. 저희 아키텍처는 효율적인 토큰 재구성(EViT)을 통해 향상된 Vision Transformer 백본을 기반으로 합니다. 광범위한 실험 결과, Xray-Visual은 ImageNet(이미지 분류), Kinetics 및 HMDB51(비디오 이해), MSCOCO(크로스 모달 검색)를 포함한 다양한 벤치마크에서 최첨단 성능을 달성했습니다. 또한, 모델은 도메인 변화 및 적대적 공격에 대한 강력한 견고성을 보여줍니다. 또한, 대규모 언어 모델(LLM)을 텍스트 인코더로 통합하면 검색 성능과 일반화 능력이 크게 향상되며, 특히 실제 환경에서 더욱 그렇다는 것을 보여줍니다. Xray-Visual은 확장 가능한 다중 모달 비전 모델에 대한 새로운 기준을 제시하며, 뛰어난 정확성과 계산 효율성을 유지합니다.

Original Abstract

We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!