2604.17769v1 Apr 20, 2026 cs.CL

역방향 헌법 기반 인공지능: 확률 제약을 이용한 제어 가능한 유해 데이터 생성 프레임워크 (RLAIF)

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Aimin Zhou

Citations: 5

h-index: 1

Yuan Fang

Citations: 4

h-index: 1

Fei Tan

Citations: 20

h-index: 3

Yiming Luo

Citations: 104

h-index: 2

대규모 언어 모델(LLM)의 안전성을 확보하기 위해서는 체계적인 적대적 테스트가 필수적이지만, 고품질의 유해 데이터를 체계적으로 생성하는 연구는 아직 부족합니다. 본 연구에서는 '역방향 헌법 기반 인공지능(R-CAI)'이라는 프레임워크를 제안합니다. R-CAI는 분산된 탈옥 프롬프트에 국한되지 않고, 자동화되고 제어 가능한 적대적 데이터 생성 시스템을 제공합니다. 무해한 헌법을 유해한 헌법으로 역전시키고, 모델 출력을 비판-수정 파이프라인을 통해 반복적으로 개선함으로써, R-CAI는 인간의 주석 없이도 다차원적인 적대적 데이터를 대규모로 생성할 수 있습니다. 하지만 유해성 관련 보상만을 최적화하면, 보상 해킹 및 의미적 일관성 저하가 발생할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 강화 학습에서 AI 피드백을 활용하는 과정에 확률 제약을 도입하여, 적대적 최적화를 안정화시키면서도 적대적인 의도를 유지합니다. 실험 결과, R-CAI는 다양하고 고품질의 유해 데이터를 생성하며, 확률 제약이 의미적 일관성을 크게 향상시킴(15%)을 보여주었으며, 동시에 적대적인 강도를 저하시키지 않았습니다. 전반적으로, R-CAI는 정렬된 언어 모델의 적대적 테스트 데이터 생성 및 체계적인 안전성 평가를 위한 완전 자동화된 프레임워크를 제공합니다.

Original Abstract

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!