2602.02498v1 Jan 14, 2026 cs.CL

학습 없이, 모델 훈련 없이 수행되는 테스트 단계에서의 유해 콘텐츠 제거

Test-Time Detoxification without Training or Learning Anything

Baturay Saglam

Yale University

Citations: 250

h-index: 9

Dionysis Kalogerias

Citations: 82

h-index: 4

대규모 언어 모델은 양성적인 입력에도 불구하고 유해하거나 부적절한 텍스트를 생성할 수 있으며, 이는 대규모로 배포될 때 위험을 초래할 수 있습니다. 따라서 안전과 사용자 신뢰를 위해 유해 콘텐츠를 줄이는 동시에 모델의 생성 품질을 유지하는 것이 중요합니다. 기존의 많은 방법들은 모델 재훈련, 기울기 정보 또는 학습된 보조 구성 요소를 사용하며, 이는 비용이 많이 들 수 있고 모델 패밀리 간 또는 완전한 블랙박스 환경으로의 이전이 어려울 수 있습니다. 우리는 입력 임베딩에 대한 완성 텍스트의 유해성을 추정하는 기울기를 근사하고, 소수의 하강 단계를 사용하여 생성 과정을 덜 유해한 방향으로 유도하는 테스트 단계 절차를 소개합니다. 이 방법은 입력 임베딩, 유해성 점수 함수, 그리고 모델의 순방향 평가만 필요로 하는 제로차수 최적화를 통해 구현됩니다. 실험 결과, 이 방법은 다양한 모델과 프롬프트에 대해 안정적인 유해성 감소를 제공하며, 대부분의 경우 전체적인 유해성-품질 균형을 가장 잘 달성합니다. 더 넓은 관점에서, 본 연구는 단어 임베딩을 효과적인 제어 변수로 활용하며, 오토 회귀 언어 모델을 확장 가능하고 안전한 텍스트 생성으로 유도하기 위해 블랙박스 최적화 기술을 더 널리 사용하도록 장려합니다. 이는 어떠한 훈련이나 중간 계산에 대한 접근 없이 가능합니다.

Original Abstract

Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

2 Citations

0 Influential

4.5 Altmetric

24.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!