2602.06440v1 Feb 06, 2026 cs.CL

TrailBlazer: 과거 상호작용 기록을 활용한 강화 학습 기반의 블랙박스 LLM 제약 우회 연구

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon

Citations: 11

h-index: 1

Ruizhi Qian

Citations: 111

h-index: 2

Minda Zhao

Citations: 14

h-index: 2

Weiyue Li

Harvard University

Citations: 285

h-index: 3

Mengyu Wang

Citations: 15

h-index: 2

대규모 언어 모델(LLM)은 다양한 분야에서 중요한 역할을 수행하며, 따라서 LLM의 안전성은 매우 중요한 과제입니다. 기존의 제약 우회 연구에서는 프롬프트 최적화, 자동화된 적대적 테스트, 난독화, 그리고 강화 학습(RL) 기반 방법 등 다양한 접근 방식이 탐구되었습니다. 하지만, 대부분의 기존 기술은 이전 상호작용 단계에서 드러난 취약점을 효과적으로 활용하지 못하여, 비효율적이고 불안정한 공격을 초래합니다. 제약 우회는 각 응답이 향후 행동에 영향을 미치는 순차적인 상호작용을 포함하므로, 강화 학습은 이러한 문제에 자연스러운 프레임워크를 제공합니다. 이러한 점에 착안하여, 우리는 이전 단계에서 수집된 취약점 신호를 분석하고 재가중하여 향후 의사 결정을 안내하는, 과거 정보에 민감한 강화 학습 기반의 제약 우회 프레임워크를 제안합니다. 우리의 연구 결과는 과거 정보를 활용하는 것만으로도 제약 우회 성공률이 향상됨을 보여줍니다. 이러한 통찰력을 바탕으로, 우리는 상호작용 기록 내의 중요한 취약점을 강조하는 어텐션 기반의 재가중 메커니즘을 도입하여, 더 적은 쿼리로 효율적인 탐색을 가능하게 합니다. AdvBench 및 HarmBench에 대한 광범위한 실험 결과는 우리의 방법이 최첨단 수준의 제약 우회 성능을 달성하면서도 쿼리 효율성을 크게 향상시킨다는 것을 보여줍니다. 이러한 결과는 강화 학습 기반 제약 우회 전략에서 과거 취약점 신호의 중요성을 강조하며, LLM 보안에 대한 적대적 연구를 발전시키는 원칙적인 방법을 제시합니다.

Original Abstract

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!