2605.05058v1 May 06, 2026 cs.CR

SoK: 대규모 언어 모델의 제어 우회 공격에 대한 강건성

SoK: Robustness in Large Language Models against Jailbreak Attacks

B. B. Zhu

Citations: 38

h-index: 4

Shi-Feng Sun

Citations: 25

h-index: 1

Fei Xu

Citations: 2

h-index: 1

Hongsheng Hu

Citations: 134

h-index: 4

Chaoxiang He

Citations: 55

h-index: 4

Shengqi Hang

Citations: 10

h-index: 2

Hanqing Hu

Citations: 13

h-index: 1

Xiuming Liu

Citations: 73

h-index: 6

Yubo Zhao

Citations: 7

h-index: 2

Zhengyan Zhou

Citations: 2

h-index: 1

Dawu Gu

Citations: 48

h-index: 4

Shuo Wang

Citations: 10

h-index: 2

대규모 언어 모델(LLM)은 뛰어난 성과를 거두었지만, 여전히 제어 우회 공격에 취약합니다. 이러한 공격은 적대적인 프롬프트를 사용하여 모델이 유해하거나 비윤리적인 콘텐츠, 또는 정책 위반 콘텐츠를 생성하도록 유도합니다. 이러한 공격은 실제 위험을 초래하며, 고위험 애플리케이션에서 안전, 신뢰, 규정 준수를 저해할 수 있습니다. 다양한 공격 및 방어 방법이 제안되었지만, 기존의 평가 방법은 부적절하며, 공격 성공률과 같은 좁은 지표에 의존하여 LLM 보안의 다면적인 특성을 제대로 반영하지 못합니다. 본 논문에서는 제어 우회 공격 및 방어에 대한 체계적인 분류 체계를 제시하고, 이러한 기술에 대한 포괄적인 평가를 위한 통합적이고 다차원적인 프레임워크인 '보안 큐브(Security Cube)'를 소개합니다. 기존의 공격 및 방어 방법에 대한 상세한 비교표를 제공하여, 문헌 전반에 걸쳐 핵심적인 통찰력과 해결되지 않은 과제를 강조합니다. '보안 큐브'를 활용하여 13가지 대표적인 공격 및 5가지 방어 방법에 대한 벤치마크 연구를 수행하고, 제어 우회 공격, 방어, 자동 평가 시스템, 그리고 LLM의 취약점을 포괄하는 현재의 기술 동향을 명확하게 보여줍니다. 이러한 평가를 바탕으로, 중요한 결과를 도출하고, 해결되지 않은 문제를 식별하며, 제어 우회 공격에 대한 LLM의 강건성을 향상시키기 위한 유망한 연구 방향을 제시합니다. 본 연구는 더욱 강력하고, 해석 가능하며, 신뢰할 수 있는 LLM 시스템을 구축하는 데 기여하고자 합니다. 관련 코드는 Code에서 확인할 수 있습니다.

Original Abstract

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!