2604.03919v1 Apr 05, 2026 cs.CV

시공간 희소 오토인코더를 이용한 비디오 표현 해석

Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders

Citations: 1

h-index: 1

Citations: 0

h-index: 0

본 연구는 비디오 표현에 대한 희소 오토인코더(Sparse Autoencoders, SAE)의 체계적인 연구를 최초로 제시합니다. 일반적인 SAE는 비디오를 해석 가능하고 단일 의미를 갖는 특징으로 분해하지만, 시간적 일관성을 훼손합니다. TopK 선택 방식은 프레임 간 불안정한 특징 할당을 초래하며, 자기 상관성을 36% 감소시킵니다. 우리는 시공간 대비 학습(spatio-temporal contrastive learning) 목표와 마트료시카 계층적 그룹핑(Matryoshka hierarchical grouping)을 제안하여 시간적 일관성을 회복하고 심지어 향상시킵니다. 대비 손실(contrastive loss) 가중치는 재구성(reconstruction)과 시간적 일관성 간의 조정 가능한 균형을 제어합니다. 두 가지 기본 모델(backbone)과 두 가지 데이터셋에 대한 체계적인 분석 결과, 다양한 구성이 재구성 충실도, 시간적 일관성, 동작 식별 또는 해석 가능성 등 다양한 목표에서 뛰어난 성능을 보임을 확인했습니다. 대비 학습을 적용한 SAE 특징은 원본 특징보다 동작 분류 정확도를 3.9% 향상시키고, 텍스트-비디오 검색 성능을 최대 2.8배 향상시켰습니다. 서로 다른 기본 모델 간의 분석 결과, 일반적인 단일 의미성(monosemanticity) 측정 방식이 기본 모델에 따라 달라지는 오류(artifact)를 포함하고 있음을 알 수 있습니다. DINOv2와 VideoMAE 모두 중립적인(CLIP) 유사성 기준에서 동일한 수준의 단일 의미성을 갖는 특징을 생성합니다. 인과 관계 분석 결과, 대비 학습은 예측 신호를 식별 가능한 소수의 특징에 집중시키는 것을 확인했습니다.

Original Abstract

We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!