2602.13427v1 Feb 13, 2026 cs.CR

대규모 언어 모델의 백도어 편향

Backdooring Bias in Large Language Models

Anudeep Das

Citations: 26

h-index: 2

Prach Chantasantitam

Citations: 4

h-index: 1

Gurjot Singh

Citations: 63

h-index: 4

M. Ponomarenko

Citations: 3

h-index: 1

Lipeng He

Citations: 34

h-index: 3

Florian Kerschbaum

Citations: 155

h-index: 6

대규모 언어 모델(LLM)은 특정 주제에 대한 편향을 유도할 수 있으며, 이는 심각한 결과를 초래할 수 있는 환경에서 점점 더 많이 사용되고 있습니다. 백도어 공격은 이러한 모델을 생성하는 데 사용될 수 있습니다. 기존의 백도어 공격 연구는 주로 블랙박스 위협 모델에 초점을 맞추어, 공격자가 모델 개발자의 LLM을 대상으로 합니다. 그러나 편향 조작의 경우, 모델 개발자 자체가 공격자가 될 수 있으며, 이는 공격자가 데이터에 대한 영향력을 크게 증가시키는 화이트박스 위협 모델을 필요로 합니다. 또한, 의미론적으로 트리거되는 백도어에 대한 연구가 증가하고 있지만, 대부분의 연구는 구문적으로 트리거되는 공격에 국한되어 있습니다. 이러한 한계에 따라, 우리는 화이트박스 환경에서 구문적 및 의미론적 트리거 백도어 공격의 잠재력을 더 잘 이해하기 위해 1000개 이상의 평가를 수행하고, 더 높은 오염 비율과 데이터 증강 기술을 사용했습니다. 또한, 모델 자체의 특성을 활용하는 방법과 외부적인 방법을 포함한 대표적인 방어 패러다임이 이러한 공격을 완화하는 데 효과적인지 연구했습니다. 우리의 분석 결과, 여러 가지 새로운 사실을 발견했습니다. 구문적 및 의미론적 트리거 공격 모두 목표 행동을 효과적으로 유도하고 유용성을 대체적으로 유지할 수 있지만, 의미론적 트리거 공격은 일반적으로 부정적인 편향을 유도하는 데 더 효과적이며, 두 가지 유형의 백도어 모두 긍정적인 편향을 유발하는 데 어려움을 겪는다는 것을 발견했습니다. 또한, 두 가지 방어 방법 모두 이러한 백도어를 완화할 수 있지만, 유용성이 크게 감소하거나 높은 계산 비용이 필요하다는 것을 확인했습니다.

Original Abstract

Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor removal, are able to mitigate these attacks. Our analysis reveals numerous new findings. We discover that while both syntactically- and semantically-triggered attacks can effectively induce the target behaviour, and largely preserve utility, semantically-triggered attacks are generally more effective in inducing negative biases, while both backdoor types struggle with causing positive biases. Furthermore, while both defense types are able to mitigate these backdoors, they either result in a substantial drop in utility, or require high computational overhead.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!