2602.10148v1 Feb 09, 2026 cs.CR

다중 모드 추론 시스템에 대한 레드팀 공격: 크로스-모달 얽힘 공격을 통한 비전-언어 모델의 제어 우회

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan

Citations: 22

h-index: 2

Shengjia Cheng

Citations: 1

h-index: 1

Teli Liu

Citations: 30

h-index: 3

Mingfeng Li

Citations: 54

h-index: 3

Min Liu

Citations: 52

h-index: 5

Sheng Sun

Citations: 660

h-index: 13

다중 모드 추론 능력을 갖춘 비전-언어 모델(VLMs)은 복잡한 다중 모드 형태의 악의적인 작업을 처리할 수 있는 잠재력을 가지고 있어, 공격 대상으로서 높은 가치를 지닙니다. 현재 주류의 블랙박스 제어 우회 공격은 모델의 주의를 분산시키고 안전 정렬 메커니즘을 우회하기 위해 악성 힌트를 여러 모드에 분산시키는 방식으로 작동합니다. 그러나 이러한 적대적 공격은 단순하고 고정된 이미지-텍스트 조합에 의존하여 공격의 복잡성을 확장하기 어렵기 때문에, VLMs의 지속적으로 진화하는 추론 능력을 레드팀 테스트하는 데 한계가 있습니다. 본 연구에서는 **CrossTALK (크로스-모달 얽힘 공격)**이라는 확장 가능한 접근 방식을 제안합니다. CrossTALK은 정보를 여러 모드에 걸쳐 확장하고 얽어 VLMs가 학습하고 일반화한 안전 정렬 패턴을 벗어나 제어 우회를 가능하게 합니다. 구체적으로, {지식 확장 가능한 재구성}은 악성 작업을 다단계 지침으로 확장하고, {크로스-모달 힌트 얽힘}은 시각적으로 표현 가능한 개체를 이미지에 통합하여 다중 모드 추론 연결을 구축하며, {크로스-모달 시나리오 중첩}은 다중 모드 컨텍스트 지침을 사용하여 VLMs를 상세한 악성 결과물로 유도합니다. 실험 결과, 제안하는 COMET 모델이 최첨단 수준의 공격 성공률을 달성했습니다.

Original Abstract

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!