2602.18252v1 Feb 20, 2026 cs.CV

이산형 이미지 토크나이저의 적대적 강건성에 관하여

On the Adversarial Robustness of Discrete Image Tokenizers

Nicolas Flammarion

Citations: 3,368

h-index: 18

Rishika Bhagwatkar

Citations: 279

h-index: 4

Irina Rish

Citations: 604

h-index: 7

Francesco Croce

Citations: 6,383

h-index: 22

이산형 이미지 토크나이저는 시각적 입력을 유한한 어휘 집합의 토큰 시퀀스로 인코딩하며, 인코더 전용, 인코더-디코더 및 디코더 전용 모델을 포함한 멀티모달 시스템에서 점차 인기를 얻고 있다. 그러나 CLIP 인코더와 달리, 적대적 공격에 대한 이들의 취약성은 아직 탐구된 바 없다. 본 연구는 이 주제를 다루는 최초의 연구로서, 우선 이산형 토크나이저가 추출한 특징을 교란하여 결과적으로 추출되는 토큰을 변경하는 것을 목표로 하는 공격 방식을 공식화한다. 이러한 공격은 계산적으로 효율적이고 애플리케이션에 종속되지 않으며, 분류, 멀티모달 검색 및 캡셔닝 작업 전반에 걸쳐 효과적이다. 두 번째로, 이러한 취약점을 방어하기 위해 강건한 CLIP 인코더에 대한 최근 연구에서 영감을 받아, 다른 모든 구성 요소는 동결한 상태에서 비지도 적대적 학습을 통해 널리 사용되는 토크나이저를 미세 조정한다. 우리의 접근 방식은 비지도 방식이며 특정 작업에 종속되지 않음에도 불구하고, 비지도 및 종단간 지도 공격 모두에 대한 강건성을 크게 향상시키며 보지 못한 새로운 작업과 데이터에 대해서도 잘 일반화된다. 지도 적대적 학습과 달리 우리의 방법은 레이블이 없는 이미지를 활용할 수 있어 범용성이 더 뛰어나다. 전반적으로 본 연구는 다운스트림 작업에서 토크나이저 강건성의 중요한 역할을 강조하며, 안전한 멀티모달 파운데이션 모델 개발에 있어 중요한 단계임을 제시한다.

Original Abstract

Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!