2605.07447v1 May 08, 2026 cs.CV

VLMs의 적대적 공격 탐지를 위한 플러그 앤 플레이 방화벽으로서의 희소 오토인코더

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Lawrence B. Hsieh

Citations: 3

h-index: 1

Hao Wang

Citations: 77

h-index: 2

Yiqun Sun

Citations: 0

h-index: 0

Pengfei Wei

Citations: 466

h-index: 11

Daisuke Kawahara

Citations: 80

h-index: 3

비전-언어 모델(VLMs)은 빠르게 발전하고 있으며, 특히 에이전트 기반 시스템의 증가와 함께 실제 응용 분야에 점점 더 많이 사용되고 있습니다. 그러나 이러한 모델의 안전성에 대한 연구는 상대적으로 부족합니다. 최신 독점 및 오픈 소스 VLMs조차도 여전히 적대적 공격에 매우 취약하며, 이는 다운스트림 응용 프로그램에 상당한 위험을 초래합니다. 본 연구에서는 희소 오토인코더(SAE)를 기반으로 하는 새로운 경량 적대적 공격 탐지 프레임워크인 SAEgis를 제안합니다. 사전 훈련된 VLM에 SAE 모듈을 삽입하고 표준 재구성 목표를 사용하여 학습시키면, 학습된 희소 잠재 특징이 자연스럽게 공격과 관련된 신호를 포착한다는 것을 확인했습니다. 이러한 특징은 입력 이미지가 적대적으로 변경되었는지 여부를 안정적으로 분류할 수 있으며, 이는 이전에 보지 못한 샘플에서도 마찬가지입니다. 광범위한 실험 결과, SAEgis는 다양한 환경(내부 도메인, 외부 도메인, 다양한 공격 유형)에서 강력한 성능을 보여주었으며, 특히 기존 방법과 비교하여 외부 도메인 일반화 성능이 크게 향상되었습니다. 또한, 여러 레이어에서 얻은 신호를 결합하면 견고성과 안정성이 더욱 향상됩니다. 현재까지, SAE를 사용하여 VLMs의 적대적 공격 탐지를 위한 플러그 앤 플레이 메커니즘을 탐구하는 첫 번째 연구입니다. 저희 방법은 추가적인 적대적 학습이 필요 없으며, 오버헤드가 최소화되고, 실제 VLM 시스템의 안전성을 향상시키는 실용적인 접근 방식을 제공합니다.

Original Abstract

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!