2603.08486v1 Mar 09, 2026 cs.CV

시각적 자기실현적 정렬: 위협 관련 이미지를 활용하여 안전 지향적 페르소나 형성

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Di Wang

Citations: 123

h-index: 8

Qishun Yang

Citations: 0

h-index: 0

Shu Yang

Citations: 299

h-index: 11

Lijie Hu

Citations: 24

h-index: 1

다중 모드 대규모 언어 모델(MLLM)은 시각적 입력이 유해한 결과를 초래하는 안전성 불일치 문제를 안고 있습니다. 이를 해결하기 위해 기존 방법은 명시적인 안전 레이블 또는 대비 데이터가 필요하지만, 유해 관련 개념은 구체적이고 시각적으로 표현 가능하지만, '도움이 되는' 것과 같은 안전 개념은 추상적이며 시각적 참조가 부족합니다. 우리는 자기실현 메커니즘에 의해 발생하는 불일치 현상에 영감을 받아 시각적 자기실현적 정렬(VSFA)을 제안합니다. VSFA는 어떠한 안전 레이블 없이, 위협 관련 이미지를 중심으로 구성된 중립적인 시각 질의응답(VQA) 작업에 대해 시각-언어 모델(VLM)을 미세 조정합니다. 위협 관련 시각 콘텐츠에 반복적으로 노출됨으로써, 모델은 경계심과 주의의 암묵적인 의미를 내면화하여 안전 지향적인 페르소나를 형성합니다. 여러 VLM과 안전성 벤치마크에 대한 실험 결과, VSFA는 공격 성공률을 감소시키고, 응답 품질을 향상시키며, 과도한 거부 현상을 완화하면서 일반적인 능력을 유지하는 것으로 나타났습니다. 본 연구는 자기실현 메커니즘을 텍스트에서 시각 모달리티로 확장하여, VLM 정렬을 위한 레이블이 필요 없는 접근 방식을 제시합니다.

Original Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!