2604.12232v1 Apr 14, 2026 cs.CR

TEMPLATEFUZZ: LLM의 제어 우회 및 레드 팀 테스트를 위한 정밀한 챗 템플릿 퍼징

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Zibo Xiao

Citations: 36

h-index: 2

En-Pei Hu

Citations: 224

h-index: 6

Qingchao Shen

Citations: 429

h-index: 7

Lili Huang

Citations: 16

h-index: 2

Yongqiang Tian

Citations: 66

h-index: 5

Junjie Chen

Citations: 5,730

h-index: 43

대규모 언어 모델(LLM)은 다양한 분야에 점점 더 많이 활용되고 있지만, 악의적인 입력이 안전 장치를 우회하여 유해한 결과를 생성하는 제어 우회 공격에 취약하다는 점은 심각한 보안 위험을 초래합니다. 기존 연구는 주로 프롬프트 주입 공격에 초점을 맞추었지만, 이러한 접근 방식은 종종 많은 리소스를 필요로 하며 챗 템플릿과 같은 중요한 구성 요소를 간과합니다. 본 논문에서는 챗 템플릿의 취약점을 체계적으로 파악하는 정밀 퍼징 프레임워크인 TEMPLATEFUZZ를 소개합니다. TEMPLATEFUZZ는 (1) 다양한 챗 템플릿 변형을 생성하기 위한 요소 수준의 변이 규칙을 설계하고, (2) 공격 성공률(ASR)을 높이는 방향으로 챗 템플릿 생성을 안내하는 휴리스틱 검색 전략을 제안하며, (3) 정확하고 효율적인 제어 우회 평가를 위한 경량 규칙 기반 오라클을 도출하기 위해 능동 학습 기반 전략을 통합합니다. TEMPLATEFUZZ는 다양한 공격 시나리오에서 12개의 오픈 소스 LLM을 대상으로 평가되었으며, 평균 ASR은 98.2%로, 정확도 저하율은 1.1%에 불과했습니다. 이는 최첨단 방법보다 ASR에서 9.1%~47.9% 향상되고 정확도 저하율에서 8.4% 향상된 결과입니다. 또한, 챗 템플릿을 지정할 수 없는 5개의 선도적인 상업용 LLM에서도 TEMPLATEFUZZ는 챗 템플릿 기반 프롬프트 주입 공격을 통해 평균 90%의 ASR을 달성했습니다.

Original Abstract

Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.

0 Citations

0 Influential

21.5 Altmetric

107.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!