2602.17696v1 Feb 06, 2026 cs.LG

LLM의 안전성은 파라미터 영역 제한을 통해 보장될 수 있는가?

Can LLM Safety Be Ensured by Constraining Parameter Regions?

Zongmin Li

Citations: 39

h-index: 3

Farah Benamara

Citations: 2

h-index: 1

Aixin Sun

Citations: 16

h-index: 2

Jian Su

Citations: 20

h-index: 3

대규모 언어 모델(LLM)은 종종 '안전 영역'을 포함하는 것으로 여겨지는데, 이는 수정 시 안전 관련 행동에 직접적인 영향을 미치는 파라미터 집합입니다. 본 연구에서는 개별 가중치부터 전체 트랜스포머 레이어에 이르기까지 다양한 파라미터 세분성을 포괄하는 네 가지 안전 영역 식별 방법을 사용하여, 크기가 다양한 네 가지 LLM 아키텍처에 대해 체계적인 평가를 수행했습니다. 10개의 안전 식별 데이터 세트를 사용하여 분석한 결과, 식별된 안전 영역 간의 IoU(Intersection over Union)를 측정한 결과, 중복되는 영역은 낮거나 중간 정도에 그쳤습니다. 또한, 안전 영역을 유틸리티 데이터 세트(즉, 유해하지 않은 쿼리)를 사용하여 더욱 세분화했을 때, 이러한 중복은 더욱 크게 감소했습니다. 이러한 결과는 현재의 기술이 안정적이고 데이터 세트에 독립적인 안전 영역을 신뢰성 있게 식별하는 데 실패하고 있음을 시사합니다.

Original Abstract

Large language models (LLMs) are often assumed to contain ``safety regions'' -- parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!