2604.17299v2 Apr 19, 2026 cs.CL

Cat-DPO: 범주 적응형 안전 정렬

Cat-DPO: Category-Adaptive Safety Alignment

Ruiyao Xu

Citations: 73

h-index: 3

Yi Nian

Citations: 113

h-index: 6

Tiankai Yang

University of Southern California

Citations: 150

h-index: 7

Yue Zhao

Citations: 24

h-index: 3

Kaize Ding

Citations: 68

h-index: 5

Xinyuan Li

Citations: 27

h-index: 3

대규모 언어 모델을 인간의 선호도에 맞추는 것은 두 가지 상반된 목표 사이의 균형을 맞춰야 합니다. 즉, 정당한 요청에 대해 도움을 제공하는 것과 유해한 요청에 대해 확실하게 거부하는 것입니다. 대부분의 선호도 기반 안전 정렬 방법은 안전성을 단일 스칼라 값으로 표현하고, 이 값을 모든 선호도 쌍에 동일하게 적용합니다. 그 결과, 평균적으로는 안전해 보이는 모델이 나타나지만, 소수의 유해 범주에서는 여전히 안전하지 않은 경향이 있습니다. 우리는 안전 정렬을 각 범주별로 제한된 최적화 문제로 정의하고, 각 유해 범주에 대해 별도의 적응형 안전 마진을 갖는 직접 선호도 최적화 알고리즘인 Cat-DPO를 개발했습니다. 이 마진은 모델이 특정 범주에서 여전히 안전하지 않은 응답을 생성할 때 조여지고, 모델이 해당 범주에 대한 학습을 완료하면 완화됩니다. 따라서 학습 신호는 각 범주의 현재 난이도를 추적하며, 하나의 전역적인 비율로 평균화하는 방식이 아닙니다. 두 가지 LLM 기반 모델과 여섯 가지 선호도 학습 기준을 사용하여 Cat-DPO는 전체적인 유용성과 안전성을 향상시키고, 각 범주별 안전성 편차와 최고-최저 격차를 줄입니다. Cat-DPO는 직접 선호도 기반 안전 정렬을 각 범주별로 개선하는 효과적인 방법입니다.

Original Abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!