2603.02262v1 Feb 28, 2026 cs.CR

정밀 조정 과정에서의 은밀한 파괴: 소량 데이터 기반 이유(Rationale) 주입을 통한 의료용 소형 언어 모델(LLM) 공격

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Wenjie Wang

Citations: 37

h-index: 4

Jiandong Gao

Citations: 56

h-index: 4

Ji Wu

Citations: 4

h-index: 2

Jingyuan Xie

Citations: 4

h-index: 1

지도 학습 기반 정밀 조정(SFT)은 의료용 대규모 언어 모델(LLM) 개발에 필수적이지만, 기존의 공격 연구는 주로 탐지 가능한 백도어 공격에 초점을 맞춰왔습니다. 본 연구에서는 SFT 과정에서 의료용 LLM의 추론 과정에 대한 새로운 공격 방법을 제안합니다. 기존의 백도어 공격과는 달리, 저희의 방법은 소량 데이터 학습 과정에 악성 이유(rationale)를 주입하여, 특정 의료 분야에 대한 모델 성능을 은밀하게 저하시킵니다. 실험 결과, 지식 덮어쓰기는 효과가 없었지만, 이유(rationale) 주입 공격은 해당 주제의 정확도가 크게 감소하는 것을 보여주었습니다. 효과적이고 은밀한 공격을 위해서는 최소한의 악성 샘플 수와 비율이 필요했으며, 이는 재앙적 망각(catastrophic forgetting)보다 더 효율적이고 정확한 공격 방법이었습니다. 본 연구는 SFT 단계에서의 공격 위험성을 보여줌으로써, 민감한 의료 분야에서의 방어 연구를 촉진하고자 합니다.

Original Abstract

Supervised fine-tuning (SFT) is essential for the development of medical large language models (LLMs), yet prior poisoning studies have mainly focused on the detectable backdoor attacks. We propose a novel poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model performance on targeted medical topics. Results showed that knowledge overwriting was ineffective, while rationale poisoning caused significant decline on the accuracy of the target subject, as long as no correct samples of the same subject appear in the dataset. A minimum number and ratio of poisoned samples was needed to carry out an effective and stealthy attack, which was more efficient and accurate than catastrophic forgetting. We demonstrate though this study the risk of SFT-stage poisoning, hoping to spur more studies of defense in the sensitive medical domain.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!