2604.13304v1 Apr 14, 2026 cs.CV

크로스 레이어 트랜스코더가 비전 트랜스포머의 활성화 값을 대체할 수 있는가? 비전에 대한 해석 가능한 관점

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

Konstantinos D. Polyzos

Citations: 201

h-index: 8

Difei Gu

Citations: 35

h-index: 3

Gerasimos Chatzoudis

Citations: 0

h-index: 0

Gemma E. Moran

Citations: 220

h-index: 7

Dimitris N. Metaxas

Citations: 223

h-index: 10

Hao Wang

Citations: 734

h-index: 8

Zhuowei Li

Citations: 90

h-index: 4

비전 트랜스포머(ViT)의 내부 활성화 값을 이해하는 것은 해석 가능하고 신뢰할 수 있는 모델을 구축하는 데 매우 중요합니다. 희소 오토인코더(SAE)는 인간이 이해할 수 있는 특징을 추출하는 데 사용되었지만, 개별 레이어에 작용하며 트랜스포머의 레이어 간 연산 구조와 각 레이어가 최종 레이어 표현을 형성하는 데 미치는 상대적인 중요성을 파악하지 못합니다. 이에 대한 대안으로, 본 연구에서는 크로스 레이어 트랜스코더(CLT)를 ViT의 MLP 블록에 대한 신뢰성 있고, 희소하며, 깊이 정보를 고려한 대체 모델로 도입합니다. CLT는 인코더-디코더 방식을 사용하여 이전 레이어의 학습된 희소 임베딩으로부터 각 MLP 이후의 활성화 값을 재구성하여, 최종 표현을 불투명한 임베딩에서 투명하고, 레이어별로 분해된 구조로 변환함으로써, 정확한 설명 가능성과 프로세스 수준의 해석 가능성을 제공합니다. 우리는 CLIP ViT-B/32 및 ViT-B/16 모델을 CIFAR-100, COCO 및 ImageNet-100 데이터셋으로 학습시켰습니다. 실험 결과, CLT는 MLP 이후의 활성화 값에 대해 높은 재구성 정확도를 달성하는 동시에, CLIP의 제로샷 분류 정확도를 유지하거나 오히려 향상시키는 것을 확인했습니다. 해석 가능성 측면에서, 레이어 간 기여도 점수는 정확한 설명을 제공하며, 최종 표현이 성능 저하를 유발하는 소수의 주요 레이어별 요소에 집중되어 있으며, 이러한 요소들을 유지하면 성능이 크게 유지된다는 것을 보여줍니다. 이러한 결과는 비전 분야에서 ViT의 대안적인 해석 가능한 프록시 모델로서 CLT를 채택하는 데 중요한 의미를 갖습니다.

Original Abstract

Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!