2602.02027v1 Feb 02, 2026 cs.AI

단일 뉴런을 활용한 모델 자가 성찰을 통해 LLM 안전성을 향상시키는 경량 정렬

Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron

Mingyang Lv

Citations: 11

h-index: 2

Jialin Wu

Citations: 46

h-index: 2

Guobin Shen

Citations: 787

h-index: 16

Feifei Zhao

Citations: 649

h-index: 14

Sicheng Shen

Citations: 45

h-index: 3

Han Shen

Citations: 91

h-index: 2

Binghao Wang

Citations: 4

h-index: 1

Zhou Yang

Citations: 8

h-index: 2

Dongcheng Zhao

Citations: 1,260

h-index: 22

Yi Zeng

Citations: 231

h-index: 9

대규모 언어 모델(LLM)의 안전성은 모델 개발의 근본적인 측면으로서 그 중요성이 점차 대두되고 있습니다. 기존의 LLM 안전 정렬은 주로 사후 학습(post-training) 방식을 통해 이루어지는데, 이는 계산 비용이 많이 들고 서로 다른 모델 간에 일반화가 잘 되지 않는 경우가 많습니다. 소수의 경량 정렬 접근 방식들은 사전에 계산된 안전 주입에 크게 의존하거나 모델 자체의 능력에 과도하게 의존하여, 일반화가 제한적이고 생성 과정에서의 효율성과 사용성이 저하되는 결과를 초래합니다. 본 연구에서는 전문가 모델의 저비용 학습만이 필요하며 단일 뉴런을 게이팅 메커니즘으로 사용하는 안전 인식(safety-aware) 디코딩 방법을 제안합니다. 모델의 내재적 능력과 외부의 가이던스 사이의 균형을 효과적으로 맞춤으로써, 우리의 접근 방식은 유용성(utility)을 보존하는 동시에 출력의 안전성을 향상시킵니다. 이 방법은 학습 오버헤드와 모델 규모에 따른 일반화 측면에서 분명한 이점을 보여주며, 대규모 언어 모델의 안전하고 실용적인 배포를 위한 경량 정렬에 대한 새로운 관점을 제시합니다.

Original Abstract

The safety of large language models (LLMs) has increasingly emerged as a fundamental aspect of their development. Existing safety alignment for LLMs is predominantly achieved through post-training methods, which are computationally expensive and often fail to generalize well across different models. A small number of lightweight alignment approaches either rely heavily on prior-computed safety injections or depend excessively on the model's own capabilities, resulting in limited generalization and degraded efficiency and usability during generation. In this work, we propose a safety-aware decoding method that requires only low-cost training of an expert model and employs a single neuron as a gating mechanism. By effectively balancing the model's intrinsic capabilities with external guidance, our approach simultaneously preserves utility and enhances output safety. It demonstrates clear advantages in training overhead and generalization across model scales, offering a new perspective on lightweight alignment for the safe and practical deployment of large language models. Code: https://github.com/Beijing-AISI/NGSD.

0 Citations

0 Influential

31 Altmetric

155.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!