2604.18756v1 Apr 20, 2026 cs.LG

희소 자동인코더의 견고성 이해를 향하여

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed

Citations: 78

h-index: 2

Sabrina Sadiekh

Citations: 1

h-index: 1

Chirag Agarwal

Citations: 14

h-index: 1

대규모 언어 모델(LLM)은 내부 그래디언트 구조를 악용하는 최적화 기반 탈옥 공격에 취약합니다. 희소 자동인코더(SAE)는 해석 가능성을 위해 널리 사용되지만, 그 견고성 영향에 대한 연구는 아직 부족합니다. 본 연구에서는 모델 가중치를 수정하거나 그래디언트를 차단하지 않고, 추론 시에 사전 훈련된 SAE를 트랜스포머 잔차 흐름에 통합하는 방식을 제시합니다. 4가지 모델 패밀리(Gemma, LLaMA, Mistral, Qwen)와 2가지 강력한 화이트박스 공격(GCG, BEAST) 및 3가지 블랙박스 벤치마크를 사용하여, SAE를 적용한 모델은 방어되지 않은 기준 모델에 비해 탈옥 성공률을 최대 5배까지 감소시키고, 모델 간 공격 전이성을 줄입니다. 매개변수 분석 결과, (i) L0 희소성과 공격 성공률 사이에 단조적인 상관 관계가 존재하며, (ii) 레이어에 따라 견고성과 성능 간의 균형이 달라지는 것을 확인했습니다. 이러한 결과는 표현 병목 현상 가설과 일치하며, 희소 투영이 탈옥 공격에 의해 악용되는 최적화 지오메트리를 재구성한다는 것을 시사합니다.

Original Abstract

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!