2601.04262v1 Jan 07, 2026 cs.LG

안전-유용성 충돌은 전역적인 현상이 아니다: 머리 수준 진단을 통한 수술적 정렬

Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

Wang Cai

Citations: 5

h-index: 2

Yilin Wen

Citations: 15

h-index: 2

Jinchang Hou

Citations: 34

h-index: 3

Du Su

Citations: 5

h-index: 1

Zhonghou Lv

Citations: 14

h-index: 3

Chenfu Bao

Citations: 16

h-index: 3

Yunfang Wu

Citations: 22

h-index: 2

Guoqiu Wang

Citations: 163

h-index: 8

대규모 언어 모델(LLM)에서의 안전 정렬은 본질적으로 다중 목표 최적화 충돌을 야기하며, 이는 종종 일반적인 기능 저하를 동반한다. 기존의 완화 전략은 이러한 충돌을 해결하기 위해 일반적으로 전체적인 기울기 기하학에 의존하지만, 트랜스포머 내의 모듈식 이질성을 간과한다. 특히, 기능적 민감도와 충돌 정도가 다양한 어텐션 헤드에 따라 크게 달라진다. 이러한 전역적인 접근 방식은 모든 파라미터에 대해 동일한 업데이트 규칙을 적용하며, 이는 종종 강한 기울기 충돌을 보이는 유용성 관련 헤드를 무분별하게 업데이트하여 최적의 균형을 이루지 못하는 결과를 초래한다. 이러한 한계를 해결하기 위해, 우리는 어텐션 헤드 수준의 진단을 희소 미세 조정과 통합하는 프레임워크인 Conflict-Aware Sparse Tuning (CAST)을 제안한다. CAST는 먼저 최적화 충돌과 기능적 민감도를 종합하여 사전 정렬 충돌 지도를 구성하고, 이를 통해 파라미터의 선택적 업데이트를 안내한다. 실험 결과, LLM에서의 정렬 충돌은 균일하게 분포되지 않는다는 것을 알 수 있었다. 일반적인 기능 저하는 주로 '높은 충돌' 헤드 그룹을 업데이트하는 것에서 주로 발생하는 것으로 나타났다. 훈련 중에 이러한 헤드를 단순히 건너뛰면, 안전을 저해하지 않고도 이러한 손실을 크게 줄일 수 있으며, 이는 해석 가능하고 파라미터 효율적인 안전-유용성 균형 개선 방법론을 제공한다.

Original Abstract

Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!