2602.01025v1 Feb 01, 2026 cs.LG

시각-언어 모델에 대한 범용적이고 전이 가능한 탈옥 공격 연구

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Yige Li

Citations: 375

h-index: 12

Yutao Wu

Citations: 22

h-index: 1

Hanxun Huang

The University of Melbourne

Citations: 1,153

h-index: 12

Christopher Leckie

Citations: 8

h-index: 1

Kaiyuan Cui

Citations: 11

h-index: 2

Xingjun Ma

Citations: 75

h-index: 5

S. Erfani

Citations: 5,672

h-index: 29

시각-언어 모델(VLM)은 시각 인코더를 활용하여 이미지와 텍스트 모두를 기반으로 텍스트 생성을 가능하게 하는 대규모 언어 모델(LLM)의 확장입니다. 그러나 이러한 다중 모달 통합은 모델이 유해한 응답을 유발하도록 설계된 이미지 기반 탈옥 공격에 취약하게 만들어 공격 표면을 확대합니다. 기존의 기울기 기반 탈옥 방법은 단일의 백박스 대리 모델에 과적합되는 적대적 패턴을 가지므로 전이가 잘 되지 않고 블랙박스 모델에는 일반화되지 않습니다. 본 연구에서는 시각 공간에서의 변환 및 정규화를 통해 적대적 패턴을 제약하고, 의미 기반 목표를 통해 텍스트 목표를 완화하는 프레임워크인 Universal and transferable jailbreak (UltraBreak)를 제안합니다. UltraBreak는 대상 LLM의 텍스트 임베딩 공간에서 손실을 정의함으로써, 다양한 탈옥 목표에 걸쳐 일반화되는 범용적인 적대적 패턴을 발견합니다. 시각 수준의 정규화와 의미적으로 안내된 텍스트 감독을 결합함으로써, UltraBreak는 대리 모델의 과적합을 완화하고 모델 및 공격 목표에 대한 강력한 전이성을 가능하게 합니다. 광범위한 실험 결과, UltraBreak는 기존의 탈옥 방법보다 일관되게 우수한 성능을 보였습니다. 추가 분석 결과, 이전 접근 방식이 전이되지 않는 이유는 의미 기반 목표를 통해 손실 지형을 평활화하는 것이 범용적이고 전이 가능한 탈옥을 가능하게 하는 데 매우 중요하기 때문임을 보여줍니다. 코드 및 관련 자료는 다음 GitHub 저장소에서 공개적으로 이용 가능합니다: [https://github.com/kaiyuanCui/UltraBreak](https://github.com/kaiyuanCui/UltraBreak)

Original Abstract

Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.

0 Citations

0 Influential

44.897207708399 Altmetric

224.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!