2601.11776v1 Jan 16, 2026 cs.CL

인공 지능의 정화: 대규모 언어 모델을 위한 자기 성찰 기반 독성 제거 프레임워크

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Kaituo Zhang

Citations: 0

h-index: 0

Zhimeng Jiang

Citations: 1

h-index: 1

Na Zou

Citations: 7

h-index: 1

최근 대규모 언어 모델(LLM)의 발전은 놀라운 생성 능력과 함께 자기 수정 및 자기 보상과 같은 새로운 자기 규제 메커니즘을 보여주었습니다. 그러나 현재의 독성 제거 기술은 이러한 내재된 능력을 거의 활용하지 못하고, 대신 외부 모듈, 노동 집약적인 데이터 어노테이션 또는 인간의 개입에 의존합니다. 이는 확장성과 일관성을 저해하는 요인입니다. 본 논문에서는 외부 모듈이나 데이터 어노테이션 없이 LLM의 고유한 능력을 활용하여 유해 콘텐츠를 탐지, 수정하고 LLM을 개선하는 완전한 자기 성찰 기반 독성 제거 프레임워크를 소개합니다. 구체적으로, 우리는 내부적인 자기 식별 메커니즘인 "유해 신호 탐지기"를 제안하고, 이를 통해 유해한 텍스트를 안전한 텍스트로 변환하는 체계적인 개입 프로세스를 결합합니다. 이 반복적인 절차를 통해 얻은 대비되는 독성 제거 데이터셋은 모델을 미세 조정하는 데 사용되어, 안전하고 일관된 텍스트 생성을 위한 모델의 능력을 향상시킵니다. DetoxLLM 및 ParaDetox와 같은 표준 데이터셋에 대한 실험 결과, 우리 방법은 최첨단 방법보다 더 나은 독성 제거 성능을 달성하면서도 의미적 충실도를 유지하는 것으로 나타났습니다. 인간의 개입이나 외부 구성 요소의 필요성을 없앰으로써, 본 논문은 LLM의 내재적인 자기 독성 제거 능력을 보여주며, 유해 콘텐츠 생성을 완화하기 위한 일관되고 효과적인 접근 방식을 제시합니다. 궁극적으로, 우리의 연구 결과는 진정으로 자기 규제되는 언어 모델의 잠재력을 강조하며, 보다 책임감 있고 윤리적으로 인도된 텍스트 생성 시스템을 위한 길을 열어줍니다.

Original Abstract

Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!