2602.01905v1 Feb 02, 2026 cs.CV

공간-의미 분해를 통한 희소 시각 표현 학습

Learning Sparse Visual Representations via Spatial-Semantic Factorization

Jianwei Yang

Citations: 1,642

h-index: 7

Theodore Zhao

Citations: 14

h-index: 3

Sid Kiblawi

Citations: 3,052

h-index: 5

N. Usuyama

Citations: 7,243

h-index: 21

Reuben Tan

Citations: 1,013

h-index: 14

Noel C. F. Codella

Citations: 301

h-index: 7

Tristan Naumann

Citations: 1,379

h-index: 13

H. Poon

Citations: 351

h-index: 7

Mu-Hsin Wei

Citations: 1,944

h-index: 13

자기 지도 학습(SSL)은 의미 이해와 이미지 재구현 사이의 근본적인 충돌에 직면합니다. 고수준 의미 SSL(예: DINO)은 증강 정렬을 위해 위치 불변적인 글로벌 토큰에 의존하는데, 이 과정은 재구현에 필요한 공간 좌표를 본질적으로 무시합니다. 반대로, 생성적 SSL(예: MAE)은 재구현을 위해 밀집된 특징 그리드를 유지하지만, 고수준 추상화를 생성하는 데 실패합니다. 본 논문에서는 시각 특징을 의미 개념과 그 공간 분포의 저랭크 곱으로 분해하여 이러한 긴장을 해소하는 프레임워크인 STELLAR을 소개합니다. 이러한 분리 덕분에 우리는 DINO 스타일의 증강 정렬을 의미 토큰에 적용하면서 동시에 픽셀 단위 재구현에 필요한 정확한 공간 매핑을 포함하는 위치 매트릭스를 유지할 수 있습니다. 16개의 희소 토큰만으로도 이 분해된 형태에서 고품질 재구현(2.60 FID)을 동시에 지원하고, 밀집 네트워크의 의미 성능(79.10% ImageNet 정확도)과 일치한다는 것을 보여줍니다. 우리의 결과는 STELLAR이 판별적 및 생성적 비전을 연결하는 다재다능한 희소 표현이며, 의미적 동일성을 공간 기하학으로부터 전략적으로 분리한다는 것을 강조합니다. 코드는 https://aka.ms/stellar 에서 확인할 수 있습니다.

Original Abstract

Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!