2605.28030v1 May 27, 2026 cs.LG

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

James T. Kwok
James T. Kwok
Citations: 2,318
h-index: 13
Shuhao Chen
Shuhao Chen
Citations: 129
h-index: 4
Weisen Jiang
Weisen Jiang
HKUST
Citations: 1,154
h-index: 12
Yu Zhang
Yu Zhang
Citations: 990
h-index: 9
Yeqi Gong
Yeqi Gong
Citations: 3
h-index: 1
Chengxiang Zhuo
Chengxiang Zhuo
Citations: 109
h-index: 5
Zang Li
Zang Li
Citations: 28
h-index: 3
Shengda Luo
Shengda Luo
Citations: 49
h-index: 3

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

3 Citations
1 Influential
31.993061443341 Altmetric
165.0 Score
Original PDF
2

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!