2601.10173v1 Jan 15, 2026 cs.CR

ReasAlign: 추론 기반 안전 정렬을 통한 프롬프트 인젝션 공격 방어

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li

Citations: 322

h-index: 8

Chaowei Xiao

Citations: 2

h-index: 1

Ning Zhang

Citations: 154

h-index: 4

G. E. Suh

Citations: 1

h-index: 1

Yankai Yang

Citations: 1

h-index: 1

대규모 언어 모델(LLM)은 다양한 분야에서 복잡한 워크플로우를 자동화할 수 있는 강력한 에이전트 시스템 개발을 가능하게 했습니다. 그러나 이러한 시스템은 외부 데이터에 포함된 악성 명령이 에이전트의 동작을 제어할 수 있는 간접 프롬프트 인젝션 공격에 매우 취약합니다. 본 연구에서는 간접 프롬프트 인젝션 공격에 대한 안전성을 향상시키는 모델 수준의 솔루션인 ReasAlign을 제시합니다. ReasAlign의 핵심 아이디어는 구조화된 추론 단계를 활용하여 사용자 쿼리를 분석하고, 충돌하는 명령어를 감지하며, 사용자의 의도된 작업을 유지하여 간접 인젝션 공격을 방어하는 것입니다. 추론 논리와 정확성을 더욱 보장하기 위해, 추론 단계를 평가하고 최적의 경로를 선택하는 선호도 최적화 판별 모델을 사용하는 테스트 시간 스케일링 메커니즘을 도입했습니다. 다양한 벤치마크를 통한 종합적인 평가 결과, ReasAlign은 방어되지 않은 모델과 유사한 유용성을 유지하면서 Meta SecAlign과 같은 기존의 강력한 방어 메커니즘보다 일관되게 우수한 성능을 보였습니다. 대표적인 개방형 벤치마크인 CyberSecEval2에서, 다양한 프롬프트 인젝션 작업을 포함하는 ReasAlign은 94.6%의 유용성과 3.6%의 ASR(Adversarial Success Rate)을 달성하여, Meta SecAlign(56.4%의 유용성과 74.4%의 ASR)보다 훨씬 우수한 성능을 보였습니다. 이러한 결과는 ReasAlign이 보안과 유용성 간의 최적의 균형을 제공하며, 실제 에이전트 시스템에서 프롬프트 인젝션 공격에 대한 강력하고 실용적인 방어를 제공함을 보여줍니다. 관련 코드 및 실험 결과는 https://github.com/leolee99/ReasAlign 에서 확인할 수 있습니다.

Original Abstract

Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.

1 Citations

0 Influential

30.931471805599 Altmetric

155.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!