2601.21864v1 Jan 29, 2026 cs.AI

KnowBias: 편향 지식 뉴런 강화를 통한 LLM의 사회적 편향 완화

KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement

Jinhao Pan

Citations: 15

h-index: 2

Chahat Raj

Citations: 91

h-index: 3

A. Mukherjee

Citations: 136

h-index: 5

Bowen Wei

George Mason University

Citations: 26

h-index: 3

Shloka Yada

Citations: 0

h-index: 0

Ziwei Zhu

Citations: 138

h-index: 5

S. Mansouri

Citations: 8

h-index: 1

거대 언어 모델(LLM)은 해로운 고정관념을 강화하는 사회적 편향을 내재하고 있어 안전한 배포에 제약이 따른다. 기존의 탈편향 방법론 대부분은 편향된 행동과 연관된 파라미터, 프롬프트, 뉴런 등을 수정하여 억제하는 패러다임을 따르지만, 이는 종종 모델을 불안정하게 만들거나 일반화 성능과 데이터 효율성을 떨어뜨리고, 모델의 범용 능력을 저하시키는 경향이 있다. 이에 본 논문에서는 편향 지식을 담고 있는 뉴런을 억제하는 대신 오히려 강화함으로써 편향을 완화하는, 경량화되고 개념적으로 차별화된 프레임워크인 KnowBias를 제안한다. KnowBias는 기여도 기반 분석을 통해 소량의 편향 지식 질문만으로 해당 뉴런을 식별하고, 추론 시점에 이를 선택적으로 강화한다. 이러한 설계는 모델의 일반 성능을 보존하면서도 강력한 탈편향 효과를 제공하며, 다양한 편향 유형 및 인구통계학적 특성에 대해 일반화가 가능하다. 또한 재학습 없이 소수의 간단한 '예/아니오' 질문만으로 수행 가능하여 데이터 효율성이 매우 높다. 다수의 벤치마크와 LLM을 대상으로 한 실험 결과, KnowBias는 유용성 저하를 최소화하면서 일관되게 최고 수준(SOTA)의 탈편향 성능을 달성함을 입증했다. 데이터와 코드는 https://github.com/JP-25/KnowBias 에서 제공된다.

Original Abstract

Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.

0 Citations

0 Influential

25.9657359028 Altmetric

129.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!