2604.18789v1 Apr 20, 2026 cs.AI

ARES: 적응적 레드 팀 기반 정책-보상 시스템의 전방위 수정

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Charith Peris

Amazon

Citations: 729

h-index: 11

Jiacheng Liang

Citations: 138

h-index: 7

Yaoyao Ma

Citations: 201

h-index: 2

Satyapriya Krishna

Citations: 8

h-index: 2

Rahul Gupta

Citations: 216

h-index: 4

Kai-Wei Chang

Citations: 18

h-index: 2

A.G. Galstyan

Citations: 204

h-index: 4

Tharindu Kumarage

Arizona State University

Citations: 657

h-index: 13

인간 피드백 기반 강화 학습(RLHF)은 대규모 언어 모델(LLM)의 정렬에 핵심적인 역할을 하지만, 불완전한 보상 모델(RM)은 중요한 취약점을 야기합니다. RM이 안전하지 않은 행동을 제대로 처벌하지 못할 경우, 이는 시스템 전체의 실패 지점으로 작용할 수 있습니다. 기존의 레드 팀 접근 방식은 주로 정책 수준의 취약점에 집중하지만, 핵심 LLM과 RM이 동시에 실패하는 '체계적 취약점'은 간과하는 경향이 있습니다. 본 논문에서는 이러한 이중적인 취약점을 체계적으로 발견하고 완화하는 프레임워크인 ARES를 제시합니다. ARES는 '안전 멘토'를 활용하여 구조화된 구성 요소(주제, 인물, 전술, 목표)를 조합하여 의미적으로 일관된 적대적 프롬프트를 동적으로 생성하고, 이에 상응하는 악의적 및 안전한 응답을 생성합니다. 이러한 이중 타겟 접근 방식을 통해 ARES는 핵심 LLM과 RM의 취약점을 동시에 드러냅니다. 발견된 취약점을 활용하여 ARES는 두 단계의 수정 프로세스를 수행합니다. 첫째, 유해 콘텐츠를 더 잘 감지하도록 RM을 미세 조정하고, 둘째, 개선된 RM을 활용하여 핵심 모델을 최적화합니다. 다양한 적대적 안전성 벤치마크를 통해 수행된 실험 결과, ARES는 모델의 능력을 유지하면서 안전성을 크게 향상시키는 것으로 나타났습니다. 이는 포괄적인 RLHF 안전성 정렬을 위한 새로운 패러다임을 제시합니다.

Original Abstract

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!